Hudzilla.org - the homepage of Paul Hudson
Contents > Functions > Regular expressions Wish List | Report Bug | About Me ]

4.8.3     Advanced regexps

This is NOT the latest copy of this book; click here for the latest version.

If you are confused already, it is probably best that you re-read the last section before continuing - the expressions only get more complicated!

We have gone through basic and novice regexps - now we're onto the powerful stuff. Regexps allow you to use the characters +, *, ?, { }, $, and ^ outside of sets to have special meaning.

The first four affect the number of a pattern it should match, and the last two affect the position. + means "match one or more of the previous expression", * means "match zero or more of the previous expression", and ? means "match 0 or 1 of the previous expression".

Here are some examples:

<?php
    preg_match
("/[A-Za-z]*/", $string);
    
preg_match("/-?[0-9]+/", $string);
    
preg_match("/\$[A-Za-z_][A-Za-z_0-9]*/", $string);
?>

The first expression will match "", "a", "aaaa", "The sun has got his hat on", and any other string containing uppercase and lowercase letters - the expression can be translated as "match zero or more uppercase and lowercase letters". The second regexp will match 1, 100, 324343995, and also -1, -100, -234011, etc - the "-?" means "match exactly 0 or 1 minus symbols".

The last regexp is fairly complicated, but, as always with regexps, complexity == power. As mentioned before, $ is a regexp symbol in its own right, however here we proceed it with a backslash, which, unsurprisingly, works as an escape character turning the $ into a standard character and not a regexp symbol. We then match precisely one symbol from the range A-Z, a-z, and _, then match zero or more symbols from the range A-Z, a-z, underscore, and 0-9. What kind of text would that match? Here are some examples: $A, $B, $C, $foo, $bar, $Test99, $_MyTest, $__Foo__. Look familiar? That's right - that regexp will match PHP variables.

Opening braces { and closing braces } can be used to define specific repeat counts in three different ways. Firstly, {n}, where n is a positive number, will match n instances of the previous expression. Secondly, {n,} will match a minimum of n instances of the previous expression. Finally, {m,n} will match a minimum of m instances and a maximum of n instances of the previous expression. Note that there are no spaces inside the braces.

Here is a list of advanced regular expressions using braces, with string used to match, and whether or not a match is made:

Regexp

String

Result

/[A-Z]{3}/

FuZ

No match; the regexp will match precisely three uppercase letters

/[A-Z]{3}/i

FuZ

Match; same as above, but case insensitive this time

/[0-9]{3}-[0-9]{4}/

555-1234

Match; precisely three numbers, a dash, then precisely four. This will match local US telephone numbers, for example

/[a-z]+[0-9]?[a-z]{1}/

aaa1

No match; must end with one lowercase letter

/[A-Z]{1,}99/

99

No match; must start with at least one uppercase letter

/[A-Z]{1,5}99/

FINGERS99

No match; start with a maximum of 5 uppercase letters

/[A-Z]{1,5}[0-9]{2}/i

adams42

Match

Finally, we have the dollar $ and caret ^ symbols, which mean "end of line" and "start of line" respectively. Consider the following string:

$multitest = "This is\na long test\nto see whether\nthe dollar\nSymbol\nand the\ncaret symbol\nwork as planned";

As you know, \n means "new line", so what we have there is a string containing the following text:

This is
a long test
to see whether
the dollar
Symbol
and the
caret symbol
work as planned

In order to parse multi-line strings correctly, we need the "m" modifier, so "m" needs to go after the final slash. Here is some PHP code - which expressions do you think will match?

<?php
    preg_match
("/is$/m", $multitest);
    
preg_match("/the$/m", $multitest);
    
preg_match("/^the/m", $multitest);
    
preg_match("/^Symbol/m", $multitest);
    
preg_match("/^[A-Z][a-z]{1,}/m", $multitest);
?>

The answer is "all of them" - they all match. Line one means "return true if 'is' is at the end of a line", line two is "return true if 'the' is at the end of a line", and line three is "return true if 'the' is at the end of a line". Line four is "return true if "Symbol" is at the start of a line", and line five is "return true if there is a capital letter followed by one or more lowercase letters at the start of a line.

As you can see, matching the beginning and end of a line is simple with the $ and ^ characters, but when combined with +, *, ?, and { }, your regular expression-matching ability should rocket upwards.

However, we're not finished yet, grasshopper - if you wish to attain regexp nirvana, you need to understand the last few secrets of regexp wisdom...





<< 4.8.2 Novice regexps   4.8.4 Guru regexps >>
Table of Contents
Want to see this stuff in print? PHP in a Nutshell takes the core topics covered here, adds in thousands of edits from the editorial team and myself, and combines them to make an unbeatable reference for PHP programmers at all levels.



My latest book has hundreds more tips on how to use PHP, Apache, and MySQL, plus Perl, Python, shell scripts, performance tuning, and more!



Top-right shadow
 
Bottom-left shadow Bottom shadow

Comments from other readers
;O_O; - 06 Sep 2008

Take a chill pill. phew, scorched.
:D
G00d b00k xD

hebrego@gmail.com - 06 Sep 2008

I think there are 2 mistakes in the table of regexp, examples 4 and 6 matched, I run:

if (preg_match("/[a-z]+[0-9]?[a-z]{1}/", "aaa1")) {
print "Got match!5<br>";
}
if (preg_match("/[A-Z]{1,5}99/", "FINGERS99")) {
print "Got match!6<br>";
}

And I got:

Got match!5
Got match!6


I believe that in first example, when it matches the first letter it is matching the second a, next with the "?" you can match nothing, then there's only one letter that matches [a-z]{1}, the third a, the following number doesn't care because you have matched all.

And in the second example, you start to match at the third letter, at the N, so this match NGERS99, if you type this:
preg_match("/[0-9][A-Z]{1,5}99/", "3FINGERS99")
It won't match.

Even though this book is great. I have learnt a lot.

March 12,2006

rmh - 06 Sep 2008

Tom,

Windows uses '\n' as a new-line character as well... it's not a 'Unix' thing, but is instead just an ascii character corresponding to a new line (I think it's 13). '<br />' is HTML. If you're confused, use a couple of newline characters ('\n') in your PHP code and look at the resulting HTML source. It will have new lines visible in the source.

Tom - 06 Sep 2008

I completely understand all of this so far, however what about us Windows users? We cannot use \n instead we have to use <br /> preg_match() function does not read <br /> as a new line.

trellph@gmail.com - 06 Sep 2008

The answer is "all of them" - they all match. Line one means "return true if 'is' is at the end of a line", line two is "return true if 'the' is at the end of a line", and line three is "return true if 'the' is at the end of a line". Line four is "return true if "Symbol" is at the start of a line", and line five is "return true if there is a capital letter followed by one or more lowercase letters at the start of a line.

That should read for the third instance " return true if 'the' is at the start of a line".

preg_match("/^the/m", $multitest);



Add comment
Please note that by posting a comment here you are committing it to the public domain. This is important so that others can make use of your code themselves, and also so that I can incorporate helpful notes directly into the main text. Comments are limited to 2000 characters in length.

If you are reporting an error in the content, please tell me directly.

Your name/email address:
Your comment:
 
Now, in order to verify that you're a real person, please answer this simple question: what is zero plus three?
The answer is:
(please write in
numbers, eg 19)


Top-right shadow
 
Bottom-left shadow Bottom shadow