4.8.3 Advanced regexpsThis is NOT the latest copy of this book; click here for the latest version.
If you are confused already, it is probably best that you re-read the last section before continuing - the expressions only get more complicated!
We have gone through basic and novice regexps - now we're onto the powerful stuff. Regexps allow you to use the characters +, *, ?, { }, $, and ^ outside of sets to have special meaning.
The first four affect the number of a pattern it should match, and the last two affect the position. + means "match one or more of the previous expression", * means "match zero or more of the previous expression", and ? means "match 0 or 1 of the previous expression".
Here are some examples:
<?php
preg_match("/[A-Za-z]*/", $string);
preg_match("/-?[0-9]+/", $string);
preg_match("/\$[A-Za-z_][A-Za-z_0-9]*/", $string); ?>
The first expression will match "", "a", "aaaa", "The sun has got his hat on", and any other string containing uppercase and lowercase letters - the expression can be translated as "match zero or more uppercase and lowercase letters". The second regexp will match 1, 100, 324343995, and also -1, -100, -234011, etc - the "-?" means "match exactly 0 or 1 minus symbols".
The last regexp is fairly complicated, but, as always with regexps, complexity == power. As mentioned before, $ is a regexp symbol in its own right, however here we proceed it with a backslash, which, unsurprisingly, works as an escape character turning the $ into a standard character and not a regexp symbol. We then match precisely one symbol from the range A-Z, a-z, and _, then match zero or more symbols from the range A-Z, a-z, underscore, and 0-9. What kind of text would that match? Here are some examples: $A, $B, $C, $foo, $bar, $Test99, $_MyTest, $__Foo__. Look familiar? That's right - that regexp will match PHP variables.
Opening braces { and closing braces } can be used to define specific repeat counts in three different ways. Firstly, {n}, where n is a positive number, will match n instances of the previous expression. Secondly, {n,} will match a minimum of n instances of the previous expression. Finally, {m,n} will match a minimum of m instances and a maximum of n instances of the previous expression. Note that there are no spaces inside the braces.
Here is a list of advanced regular expressions using braces, with string used to match, and whether or not a match is made:
|
Regexp
|
String
|
Result
|
|
/[A-Z]{3}/
|
FuZ
|
No match; the regexp will match precisely three uppercase letters
|
|
/[A-Z]{3}/i
|
FuZ
|
Match; same as above, but case insensitive this time
|
|
/[0-9]{3}-[0-9]{4}/
|
555-1234
|
Match; precisely three numbers, a dash, then precisely four. This will match local US telephone numbers, for example
|
|
/[a-z]+[0-9]?[a-z]{1}/
|
aaa1
|
No match; must end with one lowercase letter
|
|
/[A-Z]{1,}99/
|
99
|
No match; must start with at least one uppercase letter
|
|
/[A-Z]{1,5}99/
|
FINGERS99
|
No match; start with a maximum of 5 uppercase letters
|
|
/[A-Z]{1,5}[0-9]{2}/i
|
adams42
|
Match
|
Finally, we have the dollar $ and caret ^ symbols, which mean "end of line" and "start of line" respectively. Consider the following string:
$multitest = "This is\na long test\nto see whether\nthe dollar\nSymbol\nand the\ncaret symbol\nwork as planned";
As you know, \n means "new line", so what we have there is a string containing the following text:
This is
a long test
to see whether
the dollar
Symbol
and the
caret symbol
work as planned
In order to parse multi-line strings correctly, we need the "m" modifier, so "m" needs to go after the final slash. Here is some PHP code - which expressions do you think will match?
<?php
preg_match("/is$/m", $multitest);
preg_match("/the$/m", $multitest);
preg_match("/^the/m", $multitest);
preg_match("/^Symbol/m", $multitest);
preg_match("/^[A-Z][a-z]{1,}/m", $multitest); ?>
The answer is "all of them" - they all match. Line one means "return true if 'is' is at the end of a line", line two is "return true if 'the' is at the end of a line", and line three is "return true if 'the' is at the end of a line". Line four is "return true if "Symbol" is at the start of a line", and line five is "return true if there is a capital letter followed by one or more lowercase letters at the start of a line.
As you can see, matching the beginning and end of a line is simple with the $ and ^ characters, but when combined with +, *, ?, and { }, your regular expression-matching ability should rocket upwards.
However, we're not finished yet, grasshopper - if you wish to attain regexp nirvana, you need to understand the last few secrets of regexp wisdom...
|
Want to see this stuff in print? PHP in a Nutshell takes the core topics covered here, adds in thousands of edits from the editorial team and myself, and combines them to make an unbeatable reference for PHP programmers at all levels.
My latest book has hundreds more tips on how to use PHP, Apache, and MySQL, plus Perl, Python, shell scripts, performance tuning, and more!
|