Regular expressions

Regular expressions are a tool for specifying patterns (phone numbers, e-mail addresses etc.), in text strings. We can use them to search strings for such patterns, modify strings based on search results etc.

Prelude: raw strings

In Python strings the backslash character \ has a special meaning. For example \n denotes a new line, \t the tab, \U indicates a UTF-32 encoded unicode character etc.:

[1]:
print("Hello\tthere,\nthis is a cat: \U0001F431.")
Hello   there,
this is a cat: 🐱.

This creates conflicts with the syntax of regular expressions, since in this setting we want backslashes to be interpreted literally, as backslashes. There are two ways to resolve this. The first is to enter each backslash as \\ - which indicates that we really mean a backslash:

[2]:
print("Hello\\tthere,\\nthis is a cat: \\U0001F431.")
Hello\tthere,\nthis is a cat: \U0001F431.

The second method is to precede the string with the r character, to create a “raw string”, in which backlashes are treated like any other character:

[3]:
print(r"Hello\tthere,\nthis is a cat: \U0001F431.")
Hello\tthere,\nthis is a cat: \U0001F431.

While working with regular expressions this second method is usually more convenient, and we will use it below.

Visualizing regular expressions

Regular expressions are a feature of many programming languages. In Python they are implemented by the re module:

[4]:
import re

We will use the following function to illustrate the syntax of regular expressions:

[1]:
import html
import re
from IPython.core.display import display, HTML

def re_show(regex, text="", flags=0):
    """
    Displays text with the regex match highlighted.
    """
    text_css = '''"border-style: none;
                   border-width: 0px;
                   padding: 0px;
                   font-size: 14px;
                   color: darkslategray;
                   background-color: white;
                   white-space: pre;
                   line-height: 20px;"
                   '''
    match_css = '''"padding: 0px 1px 0px 1px;
                    margin: 0px 0.5px 0px 0.5px;
                    border-style: solid;
                    border-width: 0.5px;
                    border-color: maroon;
                    background-color: cornsilk;
                    color: crimson;"
                    '''


    r = re.compile(f"({regex})", flags=flags)
    t = r.sub(fr'###START###\1###END###', text)
    t = html.escape(t)
    t = t.replace("###START###", f"<span style={match_css}>")
    t = t.replace("###END###", f"</span>")
    display(HTML(f'<code style={text_css}>{t}</code>'))

The first argument of this function is a regular expression. The second is a string in which we search for the pattern specified by the regular expression. The function prints the string with the pattern matches highlighted:

[6]:
text = "This is the course MTH 548 Data Oriented Computing!"
re_show(r"is", text) # search for all occurences of "is"
This is the course MTH 548 Data Oriented Computing!

Character classes

As the above example shows, a regular expression can simply consist of a string we want to search for. The real power of regular expressions, however, is that they can contain special character sequences with a more general meaning.

Sequence

What it matches

.

Anything except the newline character.

\w

Any words character: a letter A-Z,a-z, a digit 0-9or the underscore _.

\W

Any character which is not matched by w.

\d

Any digit 0-9.

\D

All characters which are not matched by d.

[...]

Any character listed inside the square brackets.

[^...]

Any character not listed inside the square brackets.

...|...

Match either of the patterns on two sides of the vertical bar

Examples.

[7]:
# match any character followed by a "t"
re_show(r".t", text)
This is the course MTH 548 Data Oriented Computing!
[8]:
# match "i" followed by two arbitrary characters, and a non-word character:
re_show(r"i..\W", text)
This is the course MTH 548 Data Oriented Computing!
[9]:
# match two consecutive digits
re_show(r"\d\d", text)
This is the course MTH 548 Data Oriented Computing!
[10]:
# match either "D" or "d"
re_show(r"[Dd]", text)
This is the course MTH 548 Data Oriented Computing!
[11]:
# match sequences consisting of 4 characters
# different than the space " " and "a":
re_show(r"[^ a][^ a][^ a][^ a]", text)
This is the course MTH 548 Data Oriented Computing!
[12]:
# match either "is" or "in"
re_show(r"is|in", text)
This is the course MTH 548 Data Oriented Computing!

Repetitions

In regular expressions we can specify in various ways how many times some pattern should repeat in a match:

Sequence

What it means

*

Match the preceding pattern 0 or more times, as many times as possible.

+

Match the preceding pattern 1 or more times, as many times as possible.

?

Match 0 or 1 times.

{n}

Match exactly n times.

{n, m}

Match as many times as possible, but at least n times, and no more than m times.

{n,}

Match as many times as possible, but at least n times.

{,m}

Match as many times as possible, but no more than m times.

Examples.

[13]:
# match sequences consisting of exactly 6 word characters
re_show(r"\w{6}", text)
This is the course MTH 548 Data Oriented Computing!
[14]:
# match all sequences consisting of 1 or more digits
re_show(r"\d+", text)
This is the course MTH 548 Data Oriented Computing!
[15]:
# match all sequences consisting of 0 or more digits
# notice that every empty sequence between two non-digit characters will match
re_show(r"\d*", text)
This is the course MTH 548 Data Oriented Computing!
[16]:
# match all sequences of at least 3 and no more tban 5 word charcters:
re_show(r"\w{3,5}", text)
This is the course MTH 548 Data Oriented Computing!

Non-greedy matches

By default regular expression matches are greedy: they will match the longest possible part of a given string that fits the specified pattern. For example, the pattern r".+ " will match the longest possible character sequence starting with the letter “D” and ending with a space:

[17]:
re_show(r"D.* ", text)
This is the course MTH 548 Data Oriented Computing!

The following sequences modify this behavior by specifying non-greedy matches:

Sequence

What it means

*?

Match the preceding pattern 0 or more times, as few times as possible.

+?

Match the preceding pattern 1 or more times, as few times as possible.

??

Match 0 or 1 times, as few times as possible.

{n, m}?

Match as few times as possible, but at least n times and no more than m times.

{n,}?

Match as few times as possible, but at least n times.

{,m}?

Match as few times as possible, and no more than m times.

Examples.

[18]:
# match shortest possible sequences starting with "D"
# and ending with a space " "
re_show(r"D.*? ", text)
This is the course MTH 548 Data Oriented Computing!
[19]:
# match all sequences consisting of "i" followed by an "e"
# repeated 0 or 1 times, and ending with "n"
re_show(r"ie??n", text)
This is the course MTH 548 Data Oriented Computing!

Match groups

As the last example shows, by default the sequences *, *? +, +? etc. apply only to the single symbol preceding them. For example, the regular expression "ie?n" means that the character e should be repeated 0 or 1 times. In order to indicate that ? applies the whole group ie we need to enclose this group in parentheses: (ie)?n. A part of a regular expression wrapped in parentheses is called a match group.

Examples.

[20]:
# match all sequences consisting of "ei" repeated 0 or 1 times
# (as many times as possible), followed an "n"
re_show(r"(ie)?n", text)
This is the course MTH 548 Data Oriented Computing!
[21]:
# match all occurences of "is " repeated at least once,
# and as many times as possible
re_show(r"(is )+", text)
This is the course MTH 548 Data Oriented Computing!
[22]:
# match a sequence of word characters followed by a space,
# and then by another word chatacter sequence starting with either "c" or "C"
re_show(r"\w* (c|C)\w*", text)
This is the course MTH 548 Data Oriented Computing!

Compare the last example to one without grouping:

[23]:
# match either a word character sequence followed by " c"
# or a sequence staring with "C" followed by word characters
re_show(r"\w* c|C\w*", text)
This is the course MTH 548 Data Oriented Computing!

Anchors

Anchors are sequences which not match any character, but rather a specific position in a string:

Sequence

What it means

^

Match the beginning of the string

$

Match the end of the string.

\b

Match a word boundary, e.i. a space between word character and a non-word character

\B

Match a space which is not a word boundary.

[24]:
# match everything from the beginning of the string
# until the first occurence of the letter "a"
re_show(r"^.*?a", text)
This is the course MTH 548 Data Oriented Computing!
[25]:
# match all word boundaries
re_show(r"\b", text)
This is the course MTH 548 Data Oriented Computing!
[26]:
# match sequences which start with an "h",
# end at a word boundary, and are as short as possible
re_show(r"h.*?\b", text)
This is the course MTH 548 Data Oriented Computing!

Flags

In addition to a regular expression many functions in the re module accept flags, which modify the meaning of the regular expression:

Flag

What it means

re.I

Ignore distinction between lower and upper case characters.

re.M

In a multiline string the symbols ^ and $ match the beginning and the end of a line.

re.S

The symbol . matches everything, including the newline character "\n".

re.X

The regular expression may contain comments (see below for details).

Examples. We will use again the function re_show which admits an additional flags argument.

[27]:
# a multiline text sample to experiment with
from textwrap import dedent
text2 = '''
       Twinkle, twinkle, little star,
       How I wonder what you are!
       Up above the world so high,
       Like a diamond in the sky.
       '''
text2 = dedent(text2).strip()
print(text2)
Twinkle, twinkle, little star,
How I wonder what you are!
Up above the world so high,
Like a diamond in the sky.
[28]:
# find the word "twinkle" in either upper or lower case
re_show(r"twinkle", text2, flags =  re.I)
Twinkle, twinkle, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky.
[29]:
# find a sequence starting with "star", ending with "!",
# and possibly including newline characters
re_show(r"star.*!", text2, flags =  re.S)
Twinkle, twinkle, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky.
[30]:
# find shortest possible sequences which start
# at the beginning of a line, contain at least one
# character, and end at a word boundary
re_show(r"^.+?\b", text2, flags =  re.M)
Twinkle, twinkle, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky.

Note. Flags can be combined using the vertical bar | character:

[31]:
# re.I and re.S combined
re_show(r"twinkle.*!", text2, flags =  re.I|re.S)
Twinkle, twinkle, little star, How I wonder what you are! Up above the world so high, Like a diamond in the sky.

Inserting comments

More complex regular expressions can look very cryptic and be difficult to understand. To make them more readable, it is possible to insert comments explaining what they match piece by piece. The flag re.X signals that a regular expression may contain comments. If this flag is used it has the following effects:

  • All space and newline characters in the regular expression are ignored.

  • If a line in the regular expression contains the # character then this character and the remainder of the line are ignored.

  • Exception: the space and # characters are matched if they are preceded by a backslash: \, \# or they are entered as a part of a character group, i.e. enclosed in square brackets [ #] or [^ #].

[32]:
text5 = "The solution is x=5.71, y = -13.2, and  z=0."
# match numbers in the decimal form
re_show(r"""-?          # possibly the minus sign
            \d+         # digits before the decimal point
            (\.\d+)?    # possibly the decimal point and more digits
            """, text5, re.X)
The solution is x=5.71, y = -13.2, and z=0.

Matching special characters

As we have seen above, several characters (., +, * etc.) have special meaning when used in a regular expression. To match such characters literally, we precede them by a backslash \, so they become \., \+, \* and so on. The backlash itself is matched by entering \\.

Example.

[33]:
text3 = "*** \hello\ ***"
# match the sequence "***"
re_show(r"\*{3}", text3)
*** \hello\ ***
[34]:
# match a sequence which starts and ends
# with a backlash "\"
re_show(r"\\.*\\", text3)
*** \hello\ ***
[35]:
# match a sequence enclosed in parentheses "(...)"
text4 = "¯\_(ツ)_/¯"
re_show(r"\(.*\)", text4)
¯\_(ツ)_/¯