Regular Expressions in Ruby

January 17, 2012 · Posted in Development 

 

Ruby uses Perl-compatible regular expressions, so if you’re familiar with the preg_*functions in PHP, it will be easy to learn regular expressions in Ruby.

Build a regular expression

Regular expressions in Ruby can be created using different syntaxes.
The most common is by enclosing the pattern in forward-slashes. 

%r{} is usually used when the pattern contains a lot of forward-slashes (such as a filepath).

Regular expressions can also be explicitly instantiated using the Regexp class.

/[a-z0-9]+\s/mi
%r{/path/to/gif\.gif}mi
Regexp.new("[a-z0-9]+\s", Regexp::IGNORECASE | Regexp::MULTILINE)

Regexp.new

Constructs a new regular expression from pattern, which can be either a String or a Regexp (in which case that regexp’s options are propagated, and new options may not be specified (a change as of Ruby 1.8).

r1 = Regexp.new(‘^a-z+:\\s+\w+’) #=> /^a-z+:\s+\w+/
r2 = Regexp.new(‘cat’, true) #=> /cat/i
r3 = Regexp.new(‘dog’, Regexp::EXTENDED) #=> /dog/x
r4 = Regexp.new(r2) #=> /cat/i

options:
If options is a Fixnum, it should be one or more of the constants:

Regexp::EXTENDED – /x – extended mode – whitespace is ignored

Regexp::IGNORECASE – /i – case insensitive

Regexp::MULTILINE – /m – multiline mode – ‘.’ will match newline

or-ed together.

Otherwise, if options is not nil, the regexp will be case insensitive.


The lang parameter enables multibyte support for the regexp:

`n’, `N’ = none,

`e’, `E’ = EUC,

`s’, `S’ = SJIS,

`u’, `U’ = UTF-8.

 

Read more about Regexp.new: http://corelib.rubyonrails.org/classes/Regexp.html#M001571

 

Use variables in regular expressions

foo = '[\.\d]+' # a string which is variable
pattern = "referer:#{foo}"
reg1 = Regexp.new(pattern, Regexp::IGNORECASE | Regexp::MULTILINE)
reg2 = /referer:#{foo}/mi
reg3 = /referer:[\.\d]+/mi

Each expression evaluates to the same expression /referer:[\.\d]+/mi

 

if you need to escape a string in the variable foo:

foo = '192.168.1.5' # a string which is variable
reg1 = /referer:#{Regexp.escape(foo)}/mi

Evaluates to /referer:192\.168\.1\.5/mi

 

#use regular expression

string = "Here is some text referer:0.0.0.0"
string.match(reg1)

 

Regular Expressions and UTF-8 strings

Working with multibyte strings in regular expressions using \uNNNN:

pattern = '[\u0000-\u002F]+'
reg = /#{pattern }/
#or
reg = Regexp.new pattern, nil
# or with options
reg = Regexp.new pattern, Regexp::IGNORECASE | Regexp::MULTILINE

Both strings in regular expressions and string you are searching in must be UTF-8 encoded.

Otherwise you may get errors like

‘incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)’.

 

To ensure this you can use .encode method for a string

pattern = '[\u0000-\u002F]+'.encode('UTF-8')
reg = /#{pattern }/
# or
reg = Regexp.pattern pattern.encode('UTF-8'), nil
# or with options
reg = Regexp.new pattern.encode('UTF-8'), Regexp::IGNORECASE | Regexp::MULTILINE
puts reg.encoding # UTF-8

 

If you use option ‘n’ then the Regexp object will be in ASCII-8BIT or US-ASCII encoding even the pattern string is in UTF-8. But it still must work  to search in UTF-8 strings.

# 1
reg = Regexp.new '[a-zA-Z]+'.encode('UTF-8'), Regexp::IGNORECASE | Regexp::MULTILINE, 'n'
puts reg.encoding # US-ASCII
# 2
reg = Regexp.new '[\x80-\xFF]+'.encode('UTF-8'), Regexp::IGNORECASE | Regexp::MULTILINE, 'n'
puts reg.encoding # ASCII-8BIT
# 3
reg = Regexp.new '[\u0000-\u002F]+'.encode('UTF-8'), Regexp::IGNORECASE | Regexp::MULTILINE, 'n'
puts reg.encoding # US-ASCII

 

Working with multibyte strings in regular expressions using \xNN:

If you use sequences like \xNN in your regular expressions then you may get the error like “invalid multibyte escape” in Ruby 1.9.x

For example, the following regular expression gives an error.

pattern = '[\x00-\x2F]+'
reg1 = Regexp.new(pattern, Regexp::IGNORECASE | Regexp::MULTILINE)
# error: invalid multibyte escape

 

To avoid this problem you should use the following syntax to create a Regexp object:

pattern = '[\x00-\x2F]+'
reg1 = Regexp.new pattern, nil, 'n'

 

Find more discussions about this issue:

- http://www.ruby-forum.com/topic/183413

 

 

 

Read more:

- Basic Regexp methods in Ruby and comparsion with PHP functions

- Encodings in Ruby

- Details on encodings and regexp

Comments