Pattern Matching with Regular Expressions
Although basic string replacements are very effective in some cases, they are simply useless in others. For example, if you know exactly what you want to replace, such as the word “dog”, str_replace is fine. However, sometimes you only know how the word will appear in a file; somehow, you have to “describe” what the word “looks like” so PHP can find it; regular expressions are a way to write such a description using regular characters along with wildcards—characters that stand for some unknown character or group of characters.
A good example of this is an HTML anchor () tag. If you have a whole HTML page stored in a variable and want to find all of the links on the page, the functions you’ve learned so far would require that you develop a pretty complex algorithm for extracting this information. However, regular expressions allow you to specify that you know the string is something like this:
<a xhref="SOME_STRING">SOME_OTHER_STRING</a>
By doing so, you’ve eliminated most of the problem immediately. In addition to being able to find substrings like this, you can do replacements with them, or return the values you find (such as the values where SOME_STRING and SOME_OTHER_STRING appear). In this case, you would be able to parse the URL and text from HTML code (which could have been submitted by a visitor or retrieved from another Web site). However, since you don’t know what the actual text is that you’re looking for, str_replace doesn’t help any.
Pattern matching was created to accomplish this task. Pattern matching is the process of comparing one string (the string in which substrings are to be found) to another string that contains wildcard characters (the “description” of what the substring should “look like”). Wildcard characters are characters that represent one character or a set of characters. An example of a wildcard character is the asterisk; it is used on both Windows and Unix-based systems to indicate “any character(s).”
For PHP, the wildcards are used in regular expressions, a standard for how wildcards and other characters (collectively known as patterns) are written.
NOTE
In this section, you are discussing only PHP’s support for PCRE (Perl-compatible regular expressions). If you have experience with other regular expressions, you may find some of this to be a little different.
All of this new terminology at once is probably a bit confusing. The following example demonstrates a short pattern and the text it matches:
Pattern: “hello”
Matches: “hello”
As you can see, the pattern only matches one string: itself. This is very much like the behavior of str_replace; the only occurrences found are those that are exactly like the one being searched for.
This example can be expanded a little bit to make it more useful. For example, if you wanted to find the word “hello” anywhere in a sentence, you could use a wildcard to specify that it’s okay for “hello” to be bordered by any number of any characters.
The following example uses a regular expression function, preg_match, which is discussed later in this chapter, to determine whether the word “Hello” appears somewhere within a string:
<?php
/* ch05ex04.php - demonstrates simple use of regular expressions */
$string1 = 'Hello, this is string one.';
$string2 = 'This is string two.';
echo "String1 is: $string1
";
if ( preg_match("/.*Hello.*/", $string1) )
{
echo "I found 'Hello' in this string.
";
}
else
{
echo "I didn't find 'Hello' in this string.
";
}
echo "String2 is: $string2
";
if ( preg_match("/.*Hello.*/", $string2) )
{
echo "I found 'Hello' in this string.
";
}
else
{
echo "I didn't find 'Hello' in this string.
";
}
?>
The output of this program is
String1 is: Hello, this is string one.
I found ‘Hello’ in this string.
String2 is: This is string two.
I didn’t find ‘Hello’ in this string.
Just as Windows and Unix-based systems use the asterisk to specify any character, regular expressions (sometimes referred to as regexps, which is pronounced “rej-exps”) use the period to indicate “any character.” This and other wildcards are known as qualifiers. Table 5.1 shows the qualifiers PHP recognizes in regular expressions:
Table 5.1. These Qualifiers Are Understood in PHP’s Regular Expressions Qualifier Meaning
. Any character
^ The beginning of the string
$ The end of the string
[] Used to specify character classes
All other characters are also considered to be qualifiers, but these are the special ones.
For example, to specify that a string may contain the word “hello” followed by any three characters, I could use the expression “/Hello…/”. If I wanted to ensure that the string matched is the only text within the string we’re testing, I could specify that it border the beginning and end using the appropriate qualifiers; “/^Hello…$/” would do the trick.
The following program demonstrates using these two expressions:
<?php
/* ch05ex05.php – uses some more regular expressions */
$string1 = 'Hello---'; // This one matches both expressions
$string2 = 'Hi, Hello---'; // This one isn't at the beginning of the string
$string3 = 'Hello'; // This one doesn't have three characters after Hello
echo "String1 is: $string1
";
if ( preg_match("/Hello.../", $string1) )
{
echo "I found 'Hello...' in this string;
checking to see if this is all that's in the string... ";
if ( preg_match("/^Hello...$/", $string1) )
{
echo "it is.
";
}
else
{
echo "it isn't.
";
}
}
else
{
echo "I didn't find 'Hello...' in this string.
";
}
echo "String2 is: $string2
";
if ( preg_match("/Hello.../", $string2) )
{
echo "I found 'Hello...' in this string;
checking to see if this is all that's in the string... ";
if ( preg_match("/^Hello...$/", $string2) )
{
echo "it is.
";
}
else
{
echo "it isn't.
";
}
}
else
{
echo "I didn't find 'Hello...' in this string.
";
}
echo "String3 is: $string3
";
if ( preg_match("/Hello.../", $string3) )
{
echo "I found 'Hello...' in this string;
checking to see if this is all that's in the string... ";
if ( preg_match("/^Hello...$/", $string3) )
{
echo "it is.
";
}
else
{
echo "it isn't.
";
}
}
else
{
echo "I didn't find 'Hello...' in this string.
";
}
?>
The output of this program is
String1 is: Hello—
I found ‘Hello…’ in this string; checking to see if this is all that’s in the
string… it is.
String2 is: Hi, Hello—
I found ‘Hello…’ in this string; checking to see if this is all that’s in the
string… it isn’t.
String3 is: Hello
I didn’t find ‘Hello…’ in this string.
The last qualifier on the list is the set of square brackets. These are used to define character classes, or certain groups of characters from which any one character may be used. For example, if you wanted to allow only a vowel to be picked, you might use the character class [aeiou], as in “b[aeiou]t”, which would match “bat”, “bet”, “bit”, “bot”, and “but”. Notice that only one character is allowed from the set.
You can also define character ranges within a character class using the hyphen. To match any alphanumeric character, this character class could be used: [a-zA-Z0-9].
Unlike Windows and Unix, however, one dot only allows for one occurrence of a character. As you can see from the previous example, if you had an unknown or large number of wildcard characters to match, things could become quite confusing. Therefore, you have to specify how many of some thing you wish to allow. The following table shows you the modifiers used to specify how many occurrences should be matched (therefore, known as quantifiers)
Table 5.2. These Quantifiers Can Be Used to Specify How Many Occurrences of a Certain Character Are to Be Matched Quantifier Meaning
* Any number of occurrences (zero or more)
+ At least one occurrence (one or more)
? May or may not occur (zero or one)
{x} Exactly x number of occurrences
{x,y} At least x but not more than y occurrences
{x,} At least x occurrences
To use a quantifier, place it directly after a qualifier. The example above could be reexpressed as “hello.{3}”.
NOTE
If you want to use an actual period, question mark, or so forth, precede it with two backslashes (\\).
Just as you must escape quotes within a string, you must escape the special characters in regexps to get their literal meaning. This would normally be done with a single slash; however, because the regular expressions are being expressed in double-quoted strings, you have to make an exception. The slash that really escapes the special character must itself be escaped.
Before you move on, let’s spend a little bit of time practicing and getting used to regular expressions:
“hello.*” matches any string that begins with “hello”. It may include much more text or it may terminate right after the “o”. Examples include “hello, this is regexps 101″ and “hello”.
“.*hello.*” matches any string with the word “hello” in it. It could be the word “hello” alone or any combination of things, as long as “hello” appears somewhere within, such as “Why, hello John!” and “hello”.
“^hello$” matches a string containing only the word “hello”. If other characters are present, the match fails.
“[a-zA-Z0-9]+” matches any string containing alphanumeric characters only, such as “John Smith” and “Smith150″.
“ Notice that the double quote must be escaped with a slash to keep from ending the double quoted string that contains the expression. The href value itself is matched by the .* combination, and the opening quote, if present, is closed, followed by any other attributes and finally the end of the tag. This expression will become useful in demonstrating functions later in this chapter. Make sure you understand what each part of it does and why each character appears where it appears. This pattern is somewhat complex, so a more in-depth explanation of it is necessary. An anchor tag that it is designed to match might look like this: The ). The asterisk is a particularly tricky quantifier; it is referred to as a greedy quantifier because it will match the biggest string it can. This can create problems. Consider the following example string: Notice that the tag isn’t just a simple two-component tag; instead, it has a third component for class. The regular expression formulated in the preceding examples will match more of this string than you really intend for it to match. Not only will it match the tag, but it will also match the text This is a test. because at the end it is looking for the largest string of any characters before the last > character. That’s just about everything. However, you can reverse the greediness of the expression by adding question marks after the asterisk quantifiers, like this:
<a xhref="http://www.quepublishing.com">
<a.+href[ ]*=[ ]*['\"]?.*?['\"]?.*?>
Notice that the two .* sequences got the addition of a question mark; this will stop the asterisk from going for the biggest string it can find. Rather, it will go until it finds the string following it in the regular expression (>). Now, instead of going to the last
It’s also possible to let an expression match two (or more) completely different textual occurrences. In the next section of this chapter, for example, the goal is to match both the opening tag and the closing tag. To do this, the expression must be able to say “pick either one of these”. This is done by including an expression for both conditions in the expression and separating the two with a pipe (|). This is read in the expression as “or”; abc|def is the same as match abc or def in English.
Basic Pattern Matching
Now that you know the basics of pattern matching, here’s a chance to try them out. The first thing you should do is get acquainted with the preg_match function, which is the basic function for matching strings with regular expressions in PHP. It follows this syntax:
bool preg_match($expr, $str [, $result])
Where $expr is the regular expression, which must have a delimiter added to it. The easiest thing to do is add a forward slash to each end of the string, like this: “/hello/”. The slashes are a carryover from Perl that allows certain options to be added (but we won’t explore those). $str should be the string being compared to the expression, and $result, if specified, becomes an array holding the results of the match. This will be discussed in more detail soon.
For now, let’s stick with simply testing to see if an expression matches a string. At the beginning of the chapter, the idea of verifying that an email address looks valid was mentioned, so let’s use that example for now.
Before you look at any code, let’s decide what an email address should look like. The following example addresses are all valid email addresses you can use to follow along as the attributes of an email address are described:
example2001@example.com
example-email@example123.com
example.email@this-example.com
example_email@subdomain.example.com
First, you know an e-mail address has two basic parts of interest: that before the @ sign and that after it. (Of course, the @ sign itself must be present, too.) The part before the @ sign may consist of letters, numbers, periods, hyphens, and underscores. The part after the @ sign will be a domain (letters, numbers, hyphens, and periods) with any number of subdomains. For instance, a domain might be simply “example.com”, or it could be “mail.example.com”, or even “in.mail.example.com”.
Now let’s construct the expression you’ll use. The first part of the e-mail address can be expressed as this:
“[a-zA-Z0-9\.\-\_]+”
Notice that the slashes keep the special characters from meaning anything other than their literal form. Actually, it isn’t necessary to escape the period (because it is always taken literally within brackets) or the underscore (because it appears next to a bracket), but doing so can’t hurt anything.
The other part of the address is the domain. The expression for that could be
“([a-zA-Z0-9\-]+\.)+[a-zA-Z0-9\-]+”
The first part of this expression accounts for the domain and possible subdomains, while the latter half accounts for the top-level domain (such as .com, .org, or .net).
Now let’s put this together to verify an e-mail address. To do this, you’ll add the beginning and ending qualifiers; if you don’t, strings such as “ex:ample@example.com” will match although it’s not a valid address because ample@example.com matches and you didn’t specify that nothing else could be present in the variable; adding the beginning and ending qualifiers will prevent this. You’ll also have to add the slashes for delimiters on either end of the string. Here’s the resulting code:
$email = 'example-email@example-domain.com';
$validateEmail = "/^([a-zA-Z0-9\.\-\_]+)\@({[a-zA-Z0-9\-]+\.}+[a-zA-Z0-9\-]+)$/";
echo (int) preg_match($validateEmail, $email); // echos 1 for match, 0 for no match
You could insert this code into any program where you wanted to check an e-mail address for typos and it would work with very little modification.
There’s also the optional result parameter. If supplied, this parameter becomes an array containing the values of what the regular expression matched. For example, the previous code would yield an array with element 0 being ‘example@example.com’, 1 being ‘example-email’, and 2 being ‘example-domain.com’.
There are rules that dictate which elements of the array contain which matched strings. The first element (0) is always the value of the whole string that was matched. The strings under that (1, 2, 3, and so on) are numbered as the left parenthesis is encountered from left to right. Figure 5.3 illustrates the sequencing of the elements of the array containing the expression’s matches:
Figure 5.3. The elements of the result array will contain the different parts of this regular expression’s match results.
Replacements with Regular Expressions
Just as you can check to see if a string matches a pattern, you can perform replacements when strings match particular patterns. Replacements can be the same for all matches of a certain pattern, or they can be based upon what is matched.
The function you’re going to use to perform these replacements is preg_replace. This function uses the following syntax:
string preg_replace(string pattern, string replacement, string str [, int limit])
Where pattern is the pattern to match, replacement is the string to replace the pattern with, str is the string to be replacing in, and the optional parameter limit is the number of times a replacement can be made.
TIP
preg_replace is case sensitive (which means Jim and jim aren’t considered the same). A case insensitive version, pregi_replace (the “i” stands for “insensitive”), takes the same parameters, but works in a case-insensitive fashion, so that Jim, jim, and JIM are all the same.
Let’s try a replacement in which all of the links (… tags) in a string are replaced by the text “[Link]”. This requires that you go back to the href pattern you created before. The following code contains that expression:
$match = "<a.+href[ ]*=[ ]*['\"]?.*['\"]?.*>";
Although this matches the tag when the tag is the only thing in a variable, in a longer variable, it’s too greedy. This expression would end up matching everything from . To stop this, turn off the greediness of the asterisk by following it with a question mark.
Another problem with the match string is that it only matches the opening tag and not the closing tag. We need to add a provision for it to match the closing tag, also. This is done with an “or” operator (|).
Here’s the code after those changes:
$match = "<a.+href[ ]*=[ ]*['\"]?.*?['\"]?.*?>|</a>";
From there, all we have to do is add the delimiter slashes and pass it to the function. In adding the delimiter slashes, you have to escape the forward slash in the closing link tag.
The following example completes the process:
<?php
/* ch05ex06.php – replaces all links in a page with [Link] */
$str = <<<END_OF_HTML
<a xhref="http://www.example.com">This</a> is a link.
If you want a <a xhref="www.example.com">link</a>, go here.
END_OF_HTML;
$match = "/<a.+href[ ]*=[ ]*['\"]?.*?['\"]?.*?>|<\/a>/i"; // case-insensitive
echo preg_replace($match, '[Link]', $str);
?>
The output for this segment is
[Link]This[Link] is a link.
If you want a [Link]link[Link], go here.
Some replacements with regular expressions are a little more complicated. For example, say you want to make all the e-mail addresses within a string clickable. To do this you need to find the e-mail addresses, then replace those with a string that includes the e-mail address you found both in the link and as the link text.
The first step to referencing text that was matched is to understand how parentheses influence the referencing of text. Every set of parentheses in a regular expression means that it is a segment of the expression that is to be referenced. If you don’t intend to reference the value of a matched expression, it’s generally a good idea not to enclose it in parentheses unless you have to.
Now you need to be able to use the value of a certain set of parentheses. Each value is a variable named after an integer in numeric sequence, starting at one. As a rule, the whole matched string is always $0. So the first set of parentheses encountered from the left would be $1, the second would be $2, and so on.
To match an e-mail address, use the expression
$matchEmail = '/[a-zA-Z0-9\.\-\_]+\@([a-zA-Z0-9\-]+\.)+[a-zA-Z0-9\-]+/';
And to make it clickable, do a preg_replace like this:
$str = 'This is my email address: example@example.com. Try it!';
$matchEmail = '/[a-zA-Z0-9\.\-\_]+\@([a-zA-Z0-9\-]+\.)+[a-zA-Z0-9\-]+/';
echo preg_replace($matchEmail, "<a xhref=\"mailto:$0\">$0</a>", $str);
The preg_replace goes through the string and finds anything that looks like an e-mail address (as we’ve specified in the regular expression) and replaces it with a link, using the value found with the regular expression both as the link value and the link text. Here’s the HTML output:
This is my email address:
<a xhref="mailto:example@example.com">example@example.com</a>. Try it!
This covers the basic idea behind doing string replacements with references. Using references, you’re able to manipulate text to a virtually unlimited extent.