Working with regular expressions (preg_) and UTF-8 strings in PHP

Regular expression patterns \w, \d, \s will not work as expected for non-latin letters in a UTF-8 string when you use preg_ functions (like preg_match, preg_split, preg_replace).

First of all you must use modifier /u to work with UTF-8 strings correctly.

One of the best solutions to common tasks is to use the pattern escapes \P, \p, and \X, which refer to Unicode character properties.

Let’s start from some examples.

Examples

1. Match an alphanumeric character (including UTF-8 letters):

$s = 'your string in UTF-8 ';
$res = preg_match_all('/[\w\p{L}\p{N}\p{Pd}]/u', $s, $m);
print_r($m);

here:
\p{L}, \pL – a UTF-8 letter

\p{N} – a UTF-8 number

 

2. Match a certan UTF-8 character or range of characters.

2.1. Search for letter ‘À’ (LATIN CAPITAL LETTER A WITH GRAVE).

After looking into UTF-8 encoding table here we see that this letter has the code ‘\x80C3’

$s = 'your string in UTF-8 with symbol À inside  ';
$res = preg_match('/\x{80c3}/u', $s);

2.2. Search for letter latin capital letters A with accent.
All such letters are inside the range of codes \xc380 – \xc386.

$s = 'your string in UTF-8 with symbol À inside  ';
$res = preg_match('/[\x{80c3}-\x{86c3}]+/u', $s);

3. Detect non-latin (Cyrilic, Arabic, Greek…) characters.

found at http://php.net/manual/en/function.preg-match.php

preg_match("/^[a-zA-Z\p{Cyrillic}0-9\s\-]+$/u", "ABC abc 1234 АБВ абв");

4. Test for valid UTF-8 and XML/XHTML character range compatibility:

$invalid = preg_match('@[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]@u', $text)

Ref: http://www.w3.org/TR/2000/REC-xml-20001006#charsets

 

Some theory about PCRE

PCRE – Perl-compatible regular expressions

// from http://www.pcre.org/pcre.txt
By default, in UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. These sequences retain their original meanings from before UTF-8 support was available, mainly for efficiency reasons.
However, if PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:

\d – any character that \p{Nd} matches (decimal digit)
\s – any character that \p{Z} matches, plus HT, LF, FF, CR
\w – any character that \p{L} or \p{N} matches, plus underscore

 

Unicode character properties

When PCRE is built with Unicode character property support, three addi- tional escape sequences that match characters with specific properties are available

\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X an extended Unicode sequence

The property names represented by xx above are limited to the Unicode script names, the general category properties, “Any”, which matches any character (including newline), and some special PCRE properties (described in the next section).

For example:

\p{Greek}
\P{Han}

 

Newline sequences
Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following: (?>\r\n|\n|\x0b|\f|\r|\x85)

 

Read the full documentation:

http://www.pcre.org/pcre.txt

 

References

// from http://www.pcre.org/pcre.txt

—- \p{**} where the following symbols can be used instead of ** :

The following general category property codes are supported:

C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate

L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter

M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark

N Number
Nd Decimal number
Nl Letter number
No Other number

P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation

S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol

Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator