Regular Expressions in Ruby

Ruby uses Perl-compatible regular expressions, so if you’re familiar with the preg_*functions in PHP, it will be easy to learn regular expressions in Ruby.

Build a regular expression

Regular expressions in Ruby can be created using different syntaxes.
The most common is by enclosing the pattern in forward-slashes.

%r{} is usually used when the pattern contains a lot of forward-slashes (such as a filepath).

Regular expressions can also be explicitly instantiated using the Regexp class.

Source code

/[a-z0-9]+\s/mi
%r{/path/to/gif\.gif}mi
Regexp.new("[a-z0-9]+\s", Regexp::IGNORECASE | Regexp::MULTILINE)

Regexp.new

Constructs a new regular expression from pattern, which can be either a String or a Regexp (in which case that regexp’s options are propagated, and new options may not be specified (a change as of Ruby 1.8).

r1 = Regexp.new(‘^a-z+:\\s+\w+’) #=> /^a-z+:\s+\w+/
r2 = Regexp.new(‘cat’, true) #=> /cat/i
r3 = Regexp.new(‘dog’, Regexp::EXTENDED) #=> /dog/x
r4 = Regexp.new(r2) #=> /cat/i

options:
If options is a Fixnum, it should be one or more of the constants:

Regexp::EXTENDED – /x – extended mode – whitespace is ignored

Regexp::IGNORECASE – /i – case insensitive

Regexp::MULTILINE – /m – multiline mode – ‘.’ will match newline

or-ed together.

Otherwise, if options is not nil, the regexp will be case insensitive.

The lang parameter enables multibyte support for the regexp:

`n’, `N’ = none,

`e’, `E’ = EUC,

`s’, `S’ = SJIS,

`u’, `U’ = UTF-8.

Read more about Regexp.new: http://corelib.rubyonrails.org/classes/Regexp.html#M001571

Use variables in regular expressions

Source code

foo = '[\.\d]+' # a string which is variable
pattern = "referer:#{foo}"
reg1 = Regexp.new(pattern, Regexp::IGNORECASE | Regexp::MULTILINE)
reg2 = /referer:#{foo}/mi
reg3 = /referer:[\.\d]+/mi

Each expression evaluates to the same expression /referer:[\.\d]+/mi

if you need to escape a string in the variable foo:

Source code

foo = '192.168.1.5' # a string which is variable
reg1 = /referer:#{Regexp.escape(foo)}/mi

Evaluates to /referer:192\.168\.1\.5/mi

#use regular expression

Source code

string = "Here is some text referer:0.0.0.0"
string.match(reg1)

Regular Expressions and UTF-8 strings

Working with multibyte strings in regular expressions using \uNNNN:

Source code

pattern = '[\u0000-\u002F]+'
reg = /#{pattern }/
#or
reg = Regexp.new pattern, nil
# or with options
reg = Regexp.new pattern, Regexp::IGNORECASE | Regexp::MULTILINE

Both strings in regular expressions and string you are searching in must be UTF-8 encoded.

Otherwise you may get errors like

‘incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)’.

To ensure this you can use .encode method for a string

Source code

pattern = '[\u0000-\u002F]+'.encode('UTF-8')
reg = /#{pattern }/
# or
reg = Regexp.pattern pattern.encode('UTF-8'), nil
# or with options
reg = Regexp.new pattern.encode('UTF-8'), Regexp::IGNORECASE | Regexp::MULTILINE
puts reg.encoding # UTF-8

If you use option ‘n’ then the Regexp object will be in ASCII-8BIT or US-ASCII encoding even the pattern string is in UTF-8. But it still must work to search in UTF-8 strings.

Source code

# 1
reg = Regexp.new '[a-zA-Z]+'.encode('UTF-8'), Regexp::IGNORECASE | Regexp::MULTILINE, 'n'
puts reg.encoding # US-ASCII
# 2
reg = Regexp.new '[\x80-\xFF]+'.encode('UTF-8'), Regexp::IGNORECASE | Regexp::MULTILINE, 'n'
puts reg.encoding # ASCII-8BIT
# 3
reg = Regexp.new '[\u0000-\u002F]+'.encode('UTF-8'), Regexp::IGNORECASE | Regexp::MULTILINE, 'n'
puts reg.encoding # US-ASCII

Working with multibyte strings in regular expressions using \xNN:

If you use sequences like \xNN in your regular expressions then you may get the error like “invalid multibyte escape” in Ruby 1.9.x

For example, the following regular expression gives an error.

Source code

pattern = '[\x00-\x2F]+'
reg1 = Regexp.new(pattern, Regexp::IGNORECASE | Regexp::MULTILINE)
# error: invalid multibyte escape

To avoid this problem you should use the following syntax to create a Regexp object:

Source code

pattern = '[\x00-\x2F]+'
reg1 = Regexp.new pattern, nil, 'n'

Find more discussions about this issue:

– http://www.ruby-forum.com/topic/183413

– Encodings in Ruby

– Details on encodings and regexp

Build a regular expression

Regular Expressions and UTF-8 strings

Leave a Reply Cancel Reply