Regular Expressions in Ruby

 

Ruby uses Perl-compatible regular expressions, so if you’re familiar with the preg_*functions in PHP, it will be easy to learn regular expressions in Ruby.

Build a regular expression

Regular expressions in Ruby can be created using different syntaxes.
The most common is by enclosing the pattern in forward-slashes. 

%r{} is usually used when the pattern contains a lot of forward-slashes (such as a filepath).

Regular expressions can also be explicitly instantiated using the Regexp class.

[codesyntax lang=”rails”]

/[a-z0-9]+\s/mi
%r{/path/to/gif\.gif}mi
Regexp.new("[a-z0-9]+\s", Regexp::IGNORECASE | Regexp::MULTILINE)
[/codesyntax]

Regexp.new

Constructs a new regular expression from pattern, which can be either a String or a Regexp (in which case that regexp’s options are propagated, and new options may not be specified (a change as of Ruby 1.8).

r1 = Regexp.new(‘^a-z+:\\s+\w+’) #=> /^a-z+:\s+\w+/
r2 = Regexp.new(‘cat’, true) #=> /cat/i
r3 = Regexp.new(‘dog’, Regexp::EXTENDED) #=> /dog/x
r4 = Regexp.new(r2) #=> /cat/i

options:
If options is a Fixnum, it should be one or more of the constants:

Regexp::EXTENDED – /x – extended mode – whitespace is ignored

Regexp::IGNORECASE – /i – case insensitive

Regexp::MULTILINE – /m – multiline mode – ‘.’ will match newline

or-ed together.

Otherwise, if options is not nil, the regexp will be case insensitive.


The lang parameter enables multibyte support for the regexp:

`n’, `N’ = none,

`e’, `E’ = EUC,

`s’, `S’ = SJIS,

`u’, `U’ = UTF-8.

 

Read more about Regexp.new: http://corelib.rubyonrails.org/classes/Regexp.html#M001571

 

Use variables in regular expressions

[codesyntax lang=”rails”]
foo = ‘[\.\d]+’ # a string which is variable

pattern = “referer:#{foo}”
reg1 = Regexp.new(pattern, Regexp::IGNORECASE | Regexp::MULTILINE)

reg2 = /referer:#{foo}/mi

reg3 = /referer:[\.\d]+/mi
[/codesyntax]

Each expression evaluates to the same expression /referer:[\.\d]+/mi

 

if you need to escape a string in the variable foo:

[codesyntax lang=”rails”]
foo = ‘192.168.1.5’ # a string which is variable

reg1 = /referer:#{Regexp.escape(foo)}/mi
[/codesyntax]

Evaluates to /referer:192\.168\.1\.5/mi

 

#use regular expression

[codesyntax lang=”rails”]
string = “Here is some text referer:0.0.0.0”

string.match(reg1)

[/codesyntax]

 

Regular Expressions and UTF-8 strings

Working with multibyte strings in regular expressions using \uNNNN:

[codesyntax lang=”rails”]

pattern = ‘[\u0000-\u002F]+’

reg = /#{pattern }/

#or

reg = Regexp.new pattern, nil

# or with options

reg = Regexp.new pattern, Regexp::IGNORECASE | Regexp::MULTILINE

[/codesyntax]

Both strings in regular expressions and string you are searching in must be UTF-8 encoded.

Otherwise you may get errors like

‘incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)’.

 

To ensure this you can use .encode method for a string

[codesyntax lang=”rails”]

pattern = ‘[\u0000-\u002F]+’.encode(‘UTF-8’)

reg = /#{pattern }/

# or

reg = Regexp.pattern pattern.encode(‘UTF-8’), nil

# or with options

reg = Regexp.new pattern.encode(‘UTF-8’), Regexp::IGNORECASE | Regexp::MULTILINE

puts reg.encoding # UTF-8

[/codesyntax]

 

If you use option ‘n’ then the Regexp object will be in ASCII-8BIT or US-ASCII encoding even the pattern string is in UTF-8. But it still must work  to search in UTF-8 strings.

[codesyntax lang=”rails”]

# 1

reg = Regexp.new ‘[a-zA-Z]+’.encode(‘UTF-8’), Regexp::IGNORECASE | Regexp::MULTILINE, ‘n’

puts reg.encoding # US-ASCII

# 2

reg = Regexp.new ‘[\x80-\xFF]+’.encode(‘UTF-8’), Regexp::IGNORECASE | Regexp::MULTILINE, ‘n’

puts reg.encoding # ASCII-8BIT

# 3

reg = Regexp.new ‘[\u0000-\u002F]+’.encode(‘UTF-8’), Regexp::IGNORECASE | Regexp::MULTILINE, ‘n’

puts reg.encoding # US-ASCII

[/codesyntax]

 

Working with multibyte strings in regular expressions using \xNN:

If you use sequences like \xNN in your regular expressions then you may get the error like “invalid multibyte escape” in Ruby 1.9.x

For example, the following regular expression gives an error.

[codesyntax lang=”rails”]

pattern = ‘[\x00-\x2F]+’

reg1 = Regexp.new(pattern, Regexp::IGNORECASE | Regexp::MULTILINE)

# error: invalid multibyte escape

[/codesyntax]

 

To avoid this problem you should use the following syntax to create a Regexp object:

[codesyntax lang=”rails”]

pattern = ‘[\x00-\x2F]+’

reg1 = Regexp.new pattern, nil, ‘n’

[/codesyntax]

 

Find more discussions about this issue:

http://www.ruby-forum.com/topic/183413

 

 

 

Read more:

Basic Regexp methods in Ruby and comparsion with PHP functions

Encodings in Ruby

Details on encodings and regexp

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>