Functions in the re module

The Python re module had several functions for searching and modifying strings using regular expressions. Here we describe two of them. See the re module documentation for the complete list.

[1]:
import re

re.findall

re.findall(pattern, string, flags=0)

This function returns a list of all matches of the pattern in the string. The third argument, flags can be used to specify flags for the regular expression.

For example, here we find all sequences of digits in a string:

[2]:
text1 = "This costs $57 for a 100 lbs box, so $171 for 3 boxes."
re.findall(r"\d+", text1)
[2]:
['57', '100', '171', '3']

re.findall with match groups

In some cases we are interested only in a part of a match. For example, we may want to find all dollar amounts in the format “$57”, but we are interested in the numeric value “57” only. Such situations can be handled using match groups (i.e. parts of a regular expression enclosed in parentheses). If we create a match group in the pattern, then the whole pattern will be matched, but only the value of the match group will be returned:

[3]:
# search for sequences of digits starting with "$",
# but return digits only
re.findall(r"\$(\d+)", text1)
[3]:
['57', '171']

If the pattern includes more than one match group, re.findall will return a list of tuples with values of the match groups:

[4]:
text2 = "This class starts at 9:30, and ends at at 10:15"
# find all tuples in the form (hours, minutes)
re.findall(r"(\d+):(\d+)", text2)
[4]:
[('9', '30'), ('10', '15')]

Non-capturing match groups

It often happens that need to create a match group in a regular expression for the purpose of specifying what should be matched, and not because we want to retrieve its value. Such match groups can be specified using the format (?:...), and then their values will not be returned by re.findall.

[12]:
# string with a flight itinerary
from textwrap import dedent
text3 = """
        BUF 11:30 PM =>=>=>=> EWR 12:45 PM
        EWR 7:45 PM =>=>=>=>=> LHR 6:55 AM
        """
text3 = dedent(text3).strip()
print(text3)
BUF 11:30 PM =>=>=>=> EWR 12:45 PM
EWR 7:45 PM =>=>=>=>=> LHR 6:55 AM
[14]:
# find all flight arrivals and departures
re.findall(r"""(.+?)        # match departure
               \ (?:=>)+ \  # match, but not capture, the =>=> part
               (.+)         # match arrival
            """,
           text3, re.X)
[14]:
[('BUF 11:30 PM', 'EWR 12:45 PM'), ('EWR 7:45 PM', 'LHR 6:55 AM')]

re.sub

re.sub(pattern, repl, string, count=0, flags=0)

This function finds matches of the pattern in the string and replaces them with the repl string. The count argument specifies the maximum number of replacements to be performed. The default value count=0 means the all matches will be replaced. The flags argument can specify regular expression flags.

We will use the following function to illustrate the effects of re.sub:

[11]:
def show_diff(original, modified):
    print(f"\033[1mORIGINAL STRING:\033[0m\n{original}\n")
    print(f"\033[1mMODIFIED STRING:\033[0m\n{modified}")


show_diff("This is the original text.", "This is the modified text.")
ORIGINAL STRING:
This is the original text.

MODIFIED STRING:
This is the modified text.

Here is a basic application of re.sub:

[5]:
text4 = "This costs $57 for a 100 lbs box, so $171 for 3 boxes."

# replace all sequences of digits by the string "(NUMBER)"
new_text = re.sub(r"\d+", r"(NUMBER)", text4)

# print results
show_diff(text4, new_text)
ORIGINAL STRING:
This costs $57 for a 100 lbs box, so $171 for 3 boxes.

MODIFIED STRING:
This costs $(NUMBER) for a (NUMBER) lbs box, so $(NUMBER) for (NUMBER) boxes.

re.sub with backreferences

The function re.sub is more flexible than the example above suggests, since the value of the replacement string repl can depend on the value of the match being replaced. In order to make use of this, we need to specify one or more match groups in the pattern. Each capturing match group creates a backreference i.e., a label \1, \2, \3 etc. (with \1 denoting the leftmost match group). When these labels are used in the replacement string, they by themselves get replaced by values of the corresponding match groups.

Examples.

[8]:
text4 = "This costs $57 for a 100 lbs box, so $171 for 3 boxes."

# add decimal digits ".00" to all prices
new_text = re.sub(r"""(\$\d+)    # match "$" followed by digits
                                 # the whole pattern is a match group
                                 # with backreference \1
                   """,
                  r"\1.00", text4, flags=re.X)

# print results
show_diff(text4, new_text)
ORIGINAL STRING:
This costs $57 for a 100 lbs box, so $171 for 3 boxes.

MODIFIED STRING:
This costs $57.00 for a 100 lbs box, so $171.00 for 3 boxes.
[9]:
text5 = "Flight itinerary:\nBUF => EWR =>=>=> LHR"

# reformat the itinerary
new_text = re.sub(r"""(.+?)           # \1 first airport
                      \ (?:=>)+\      #    the =>=> part
                      (.+)            # \2 second airport
                      \ (?:=>)+\      #    the =>=> part
                      (.+)            # \3 third airport
                      """,
                  r"From: \1 To: \2\nFrom: \2 To: \3",
                  text5, flags=re.X)

# print results
show_diff(text5, new_text)
ORIGINAL STRING:
Flight itinerary:
BUF => EWR =>=>=> LHR

MODIFIED STRING:
Flight itinerary:
From: BUF To: EWR
From: EWR To: LHR

re.sub with a function argument

There is even more that can be accomplished using re.sub. The value of the argument repl instead of a string can be a function. For each match of the pattern this function is called with a match object argument, which contains information about the match (its position in the string, values of match groups etc.). The function needs to return a string, which will be substituted for the matched pattern. For more information see the documentation of the re module.

Example.

The code below replaces distances given in miles by distances in kilometers.

[10]:
from textwrap import dedent
text6 = """
        Buffalo is 370 miles away from New York City
        (about 6 hours by car) and 100 miles from Toronto.
        """
text6 = dedent(text6).strip()


def convert(match):

    # match.group(1) is the value od the first match group
    # in our case it will be a number with a distance in miles
    d_km = int(float(match.group(1)) * 1.606)
    # return a string with the distance in km
    return f"{d_km} km"


# find distances in miles and replace them
# by values of the `convert` function
new_text = re.sub("(\d+) miles", convert, text6)

# print results
show_diff(text6, new_text)
ORIGINAL STRING:
Buffalo is 370 miles away from New York City
(about 6 hours by car) and 100 miles from Toronto.

MODIFIED STRING:
Buffalo is 594 km away from New York City
(about 6 hours by car) and 160 km from Toronto.