Backreferences in Regular Expressions

We have seen basic expressions in which almost everything revolves around single characters of text. Even character class lists are really used to match a single position.

You can use parentheses to divide a long regular expression into smaller portions. Each portion then becomes a regexp on its own. This does not affect the way a string search is done. However, each subpattern can then be used in a replacement operation.

Subpatterns are numbered from 0 to 9. Subpattern 0 is reserved and represents the complete matched string. Note that subpattern 0 is implicit and is always available, even if the expression does not contain parentheses. Explicit subpatterns are numbered from 1 to 9, starting from the left of the expression.

Subpatterns can be referenced in a replacement regexp by using the escape character, a backslash, followed by the subpattern number. When applying the replacement string, backreferences are substituted with the actual matching text. Backreferences can be used as many times as needed. Each reference ends up with the original text.

Let’s say we have a file that contains a series of phone numbers. In North America (and possibly other countries), phone numbers contain a 3-digit area code followed by a 3-digit exchange number and, finally, a 4-digit individual number. Unfortunately, in our case, the phone numbers are just series of 10 numeric digits without separators. For example,

  1234567890  1112224444  9087374456  

We would like a fast and easy way to format them so that the numbers are easier to read. To find all these strings, we can use the following regexp:

  ([0-9][0-9][0-9])([0-9][0-9][0-9])([0-9][0-9][0-9][0-9])  

We use the [0-9] character class to specify that we are expecting only numeric digits.

In this example there are three subpatterns in the regexp. Each subpattern is enclosed in a set of parentheses.

The first subpattern (\1) repeats the numeric character class 3 times. It represents the digits in the area code. The second subpattern (\2) also has the character class repeated 3 times. It represents the exchange number. The third and last subpattern (\3) repeats the character class 4 times. It represents the individual number.

Subpattern represents all 10 digits.

To reformat this information, we can now combine backreferences with other characters to arrange the numbers any way we like. Let’s say we want to put the area code in parentheses, insert a space after it and insert a dash (-) between the exchange number and the individual number. We would use the following substitution string:

  (\1) \2-\3  

The list would appear as follows:

  (123) 456-7890  (111) 222-4444  (908) 737-4456  

You can use subexpressions to reorder the text in lines. For example, if we want to reverse the telephone numbers (show the last four digits first, then the first three digits of the telephone number, followed by the area code), we could use

  Last: \3 Middle: \2 Area: \1 Complete:   

This substitution string produces the following list (assuming that we started with the same data in this example and used the same regexp):

  Last: 7890 Middle: 456 Area: 123 Complete: 1234567890  Last: 4444 Middle: 222 Area: 111 Complete: 1112224444  Last: 4456 Middle: 737 Area: 908 Complete: 9087374456  

Backreferences in Regular Expressions