4. Regular Expressions

Regular expressions are strings that can be recognized by a regular grammar, a restricted type of context-free grammar. Basically, they are strings that can be parsed left to right, without backtracking, and requiring only exact symbol matching, matching of a symbol by a category of symbols, or matching of a symbol by a specified number of sequential occurrences of a symbol or category.

Perl includes an evaluation component that, given a pattern and a string in which to search for that pattern, determines whether -- and if so, where -- the pattern occurs. These patterns are referred to as regular expressions.

Perl provides a general mechanism for specifying regular expressions. By default, regular expressions are strings that are bounded or delimited by slashes, e.g., /cat/. By default, the string that will be searched is $_. However, the delimiter can be changed to virtually any nonalphanumeric character by preceding the first occurrence of the new delimiter with an m, e.g., m#cat#. In this example, the pound sign (#) becomes the delimiter. And, of course, one can apply the expression to strings other than those contained in the default variable, $_, as will be explained below.

In addition to providing a general mechanism for evaluating regular expressions, Perl provides several operators that perform various manipulations on strings based upon the results of the evaluation. Several of these were introduced in the Perl/CGI Tutorial. They included the substitution and split operators. They will be described in more detail, below.

The discussion will begin by describing the various mechanism for specifying patterns and then discuss expression-based operators.


4.1 Patterns

literals

The simples form of pattern is a literal string. Thus, one can search for /cat/, as discussed in the introduction to this section. Normally, such an expression would appear in some conditional context, such as an if statement.

Example:

if (/cat/) { print "cat found in $_\n"; }

single-character patterns

In addition to including literal characters, expressions can contain categories of characters. The period ( . ) stands for any single character.

Example:

/.at/ # matches "cat," "bat", but not "at"

An explicit category or class of characters can be specified by placing the characters in square brackets.

Example:

/[0123456789]/

Ranges of characters can also be specified:

Examples:

/[0-9]/ /[a-z]/ /[A-Z]/ /[0-9a-zA-Z]/

Several predefined categories are available. These include:

\d # digits \w # words \s # space \D # not digits \W # not words \S # not space

Any character or range can be turned into a not condition by placing a carat ( ^ ) in front of it.

Example:

/[^0-9]/ # not a digit

sequences

In addition to the literals and single category instances discussed above, patterns can include sequences in which a given symbol or category can occur a variable, but specified, number of times. An Asterisk ( * ) indicates any number of occurrences of any character that occurs in the position where the asterisk occurs in the pattern. A plus sign ( + ) indicates one or more of the preceding character. The question mark ( ? ) indicates zero or one of the preceding character. The concept of multiplier implied by these facilities can be generalized by placing curly braces around a minimum and a maximum number of occurrences of the preceding character. Specialized forms of the general multiplier exist, as shown in the examples that follow.

Examples:

/a*t/ # any number of a's followed by t /a+t/ # one or more a's followed by t /a?t/ # zero or one a followed by t /a{2,4}t/ # between 2 and 4 a's followed by t /a{2,}t/ # 2 or more a's followed by t /a{2}t/ # exactly 2 a's followed by t

Pattern matching is greedy, meaning that if a pattern can be found at more than one place in the string but one instance is longer than the others, the longest match will be identified, thereby affecting patterned-based operators such as substitution, discussed below.

memory

The portion of the string that matches a pattern can be assigned to a variable for use later in the statement or in subsequent statements. This is done by placing the portion to be remembered in parentheses ( () ). Within the same statement, the matched segment will be available in the variable, \1. Multiple segments, specified by multiple occurrences of parentheses through the pattern, are available in variables, \1, \2, \3, etc. in the order corresponding to the different parenthesized components. Beyond the scope of the statement, these stored segments are available in the variables, $1, $2, $3, etc.

Other information available in variables include $&, the sequence that matched; $`, everything in the string up to the match; and $', everything in the string beyond the match.

Examples:

/c(.*)t/ # in caaat, \1 is "aaa"; $1 has the same value $& is "aaa" $` is "c" $' is "t"

anchors

The pattern that is searched for in the string can be restricted to several specified locations, such as the beginnings and endings of words or the beginnings and endings of the string. \b indicates a word boundary. \B indicates any place but a word boundary. Carat ( ^ ) restricts the pattern to the beginning of the string. Dollar sign ( $ ) specifies the end of the string. If a literal dollars sign occurs in the pattern, mark it with the backslash.

Example:

/\bat/ # matches "at" and "attention", but not "bat" /at\b/ # matches "at" and "bat", but not "attention" /at\B/ # matches "attention" but not "at" and "bat" /^at/ # matches "at $5.00, it is a bargain" but not "where you are at" /at$/ # matches "where you are at" but not "at $5.00, it is a bargain" /\$/ # matches "at $5.00, it is a bargain"

variable interpolation

Variables are interpolated. Since the dollar sign is used to mark ends of strings, as explained above, it should not conflict with interpolation of scalar variables that begin with a dollar sign.

Example:

$word = "cat; /$word/ # matches strings that contain "cat"

precedence

Know that it exists. Look it up in a text on Perl, if you like. Use parentheses!

explicit target string

The ( =~ ) operator takes two arguments: a string on the left and a regular expression pattern on the right. Instead of searching in the string contained in the default variable, $_, the search is performed in the string specified on the left.

Example:

$a =~ /cat/ # does the content of $a contain "cat"? <STDIN> =~ /cat/ # does the next line of input contain "cat"?

case

Case can be ignored in the search by placing an ( i ) immediately after the last delimiter.

Example:

/cat/i # matches "cat", "CAT", "Cat", etc.


4.2 Regular expression operators

Regular expression operators include a regular expression as an argument but instead of just looking for the pattern and returning a truth value, as in the examples above, they perform some action on the string, such as replacing the matched portion with a specified substring, like the well-known "search and replace" commands in word processing programs.

substitution

Looks for the specified pattern and replaces it with the specified string. By default, it does this for only the first occurrence found in the string. Appending a ( g ) to the end of the expression tells the operator to make the substitution for all occurrences.

Form:

s/pattern/replacement/ s/pattern/replacement/gi $var =~ s/pattern/replacement/

In the second version, ( g ) and ( i ) indicate that the replacement should be made for all occurrences and that the match should ignore case. In the third version, the action is performed on the variable indicated -- $var -- instead of on the default variable, $_. Thus, the operator behaves somewhat like the assignment operator; hence its form that includes an "equal" symbol as part of it.

Examples:

s/cat/dog/ # replaces "cat" with "dog" in $_ s/cat/dog/gi # same thing, but applies to "CAT", "Cat" wherever they appear $a =~ s/cat/dog/ # applies the operation to $a

split( )

Split searchers for a pattern in a specified string and, if it finds it, throws away the portion t.hat matched and returns the "before" and "after" substrings, as a list.

Form:

@var = split(/pattern/, string); @var = split(/pattern/)

If no string is specified, the operator is applied to $_.

Examples:

@a = split(/cat/, $aString); @a = split(/cat/);

In the first example, the contents of $aString are split on "cat" and the two parts assigned to the array, @a. In the second, the operator is applied to the contents of $_.

join( )

Approximately the opposite of split. Takes a list of values, concatenates them, and returns the resulting string.

Form:

$var = join("item_1", $item2, . . .);

Example:

$a = join('cat", "dog", "bird"); # returns "catdogbird" $a = join($b, $c);