Regular Expressions

Conrad Huang

April 11, 2014

Portions Copyright © 2005-06 Python Software Foundation.

Introduction

  • How to count the blank lines in a file?
    • Most people consider a line with just spaces and tabs to be blank
    • But examining characters one by one is tedious
    • More complex patterns (like telephone numbers or email addresses) are hard to describe in code
  • Use regular expressions (REs) instead
    • Represent patterns as strings
    • Just like the * in the shell's *.txt
  • Warning: the notation is ugly
    • Have to use what's on the keyboard, instead of inventing new symbols the way mathematicians do

A Simple Example

  • The simplest kind of RE matches a fixed string of characters
    • Similar to the in operator
  • import re
    
    dragons = [
    ['CTAGGTGTACTGATG',    'Antipodean Opaleye'],
    ['AAGATGCGTCCGTAT',    'Common Welsh Green'],
    ['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
    ['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
    ['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
    ]
    
    for (dna, name) in dragons:
    if re.search('ATGCGT', dna):
        print name
    Common Welsh Green
    Hungarian Horntail

This or That

  • Modify the regular expression a little
    import re
    
    dragons = [
    ['CTAGGTGTACTGATG',    'Antipodean Opaleye'],
    ['AAGATGCGTCCGTAT',    'Common Welsh Green'],
    ['AGTCGTGCTCGTTATATC', 'Hebridean Black'],
    ['ATGCGTCGTCGATTATCT', 'Hungarian Horntail'],
    ['CCGTTAGGGCTAAATGCT', 'Norwegian Ridgeback']
    ]
    
    for (dna, name) in dragons:
    if re.search('ATGCGT|GCT', dna):
        print name
    Common Welsh Green
    Hebridean Black
    Hungarian Horntail
    Norwegian Ridgeback
  • The vertical bar | means “or”
    • So this RE matches any string containing either "ATGCGT" or "GCT"

Precedence

  • What about matching either "ATA" or "ATC" (both of which code for isoleucine)?
    • ATA|C will not work: it matches either "ATA" or "C"
    • ATA|ATC will work, but it's a bit redundant
  • Solution: use parentheses, just as in math
    import re
    
    tests = [
    ['ATA',   True],
    ['xATCx', True],
    ['ATG',   False],
    ['AT',    False],
    ['ATAC',  True]
    ]
    
    for (dna, expected) in tests:
        actual = re.search('AT(A|C)', dna) is not None
        assert actual == expected
    • Note that there's no output: the assert statement will crash the program if any of the tests fail

Escaping Special Characters

[Double Compilation of Regular Expressions]
  • How to match an actual "|", "(", or ")"?
  • Solution is to use \|, \(, or \) in the RE
    • And of course \\ to match a backslash
  • But in order to put a backslash in a Python string, you have to escape it
    • So the written form of the RE is "\\|", "\\(", "\\)", or "\\\\"
  • What you type in is being compiled twice:
    • Once by Python to create a string
    • Once by the regular expression library to create the RE

Raw Strings

  • To help keep things readable, Python supports raw strings
    • Written as r'abc' or r"this\nand\nthat"
    • Inside a raw string, a backslash is just a backslash
    • So r'\n' is a string containing the two characters "\" and "n", not a newline
  • Raw strings are not automatically converted into REs
    • But that is their most common use

Sequences

[Zero or More]
  • In the shell, "*" matches zero or more characters
  • In an RE, * is an operator that means, “match zero or more occurrences of a pattern”
    • Comes after the pattern, not before

Sequences (cont.)

  • Example: match any strand of DNA in which "TTA" and "CTA" are separated by any number of "G"
    tests = [
    ['TTACTA',    True],  # separated by zero G's
    ['TTAGCTA',   True],  # separated by one G
    ['TTAGGGCTA', True],  # separated by three G's
    ['TTAXCTA',   False], # an X in the way
    ['TTAGCGCTA', False], # an embedded X in the way
    ]
    
    for (dna, expected) in tests:
        actual = re.search('TTAG*CTA', dna) is not None
        assert actual == expected
    • Note that the RE matches "TTACTA" because G* can match zero occurrences of "G"

Sequences (cont.)

  • + matches one or more (i.e., won't match the empty string)
  • assert re.search('TTAG*CTA', 'TTACTA')
    assert not re.search('TTAG+CTA', 'TTACTA')
[One or More]

Making Something Optional

  • The ? operator means “optional”
    • i.e., zero or one occurrences, but no more
  • assert re.search('AC?T', 'AT')
    assert re.search('AC?T', 'ACT')
    assert not re.search('AC?T', 'ACCT')
[Zero or One]

Character Sets

  • Use [] to match sets of characters
    • The expression [abcd] matches exactly one "a", "b", "c", or "d"
    • Can be abbreviated as [a-d]
  • Often combined with *, +, or ?
    • [aeiou]+ matches any non-empty sequence of vowels

Character Sets (cont.)

  • Example: find lines containing numbers
    import re
    
    lines = [
        "Charles Darwin (1809-82)",
        "Darwin's principal works, The Origin of Species (1859)",
        "and The Descent of Man (1871) marked a new epoch in our",
        "understanding of our world and ourselves.  His ideas",
        "were shaped by the Beagle's voyage around the world in",
        "1831-36."
    ]
    
    for line in lines:
        if re.search('[0-9]+', line):
            print line
    Charles Darwin (1809-82)
    Darwin's principal works, The Origin of Species (1859)
    and The Descent of Man (1871) marked a new epoch in our
    1831-36.
    • Try writing this without using regular expressions…

Abbreviations

  • Some character sets occur so often that they have abbreviations
    Sequence Equivalent Explanation
    \d [0-9] Digits
    \s [ \t\r\n] Whitespace
    \w [a-zA-Z0-9_] Word characters (i.e., those allowed in variable names)
    Regular Expression Escapes in Python

Special Cases

[Word/Non-Word Breaks]
  • [^abc] means “anything except the characters in this set”
  • . means “any character except the end of line”
    • Equivalent to [^\n]
  • \b matchs the break between word and non-word characters
    • Doesn't consume any actual characters

Special Cases (cont.)

  • Example: find words that end in a vowel
    • Use string.split to break on whitespace before applying RE
    • import re
      
      words = '''Born in New York City in 1918, Richard Feynman earned a
      bachelor's degree at MIT in 1939, and a doctorate from Princeton in
      1942. After working on the Manhattan Project in Los Alamos during
      World War II, he became a professor at CalTech in 1951.  Feynman won
      the 1965 Nobel Prize in Physics for his work on quantum
      electrodynamics, and served on the commission investigating the
      Challenger disaster in 1986.'''.split()
      
      end_in_vowel = set()
      for w in words:
          if re.search(r'[aeiou]\b', w):
              end_in_vowel.add(w)
      for w in end_in_vowel:
          print w
      a
      Prize
      degree
      became
      doctorate
      the
      he

Anchoring

[Anchoring Matches]
  • How to find blank lines?
    • re.search(r'\s*', line) will match "start end"
  • Use anchors
    • ^ matches the beginning of the string
    • $ matches the end
    • Neither consumes any characters

Anchoring (cont.)

  • Examples:
    Pattern Text Result
    b+ "abbc" Matches
    ^b+ "abbc" Fails (string doesn't start with b)
    c$ "abbc" Matches (string ends with c)
    ^a*$ aabaa Fails (something other than "a" between start and end of string)
    Regular Expression Anchors in Python

Extracting Matches

  • Problem: want to find comments in a data file
    • A comment starts with a "#", and extends to the end of the line
  • First try: If the RE matches, split on the "#"
    import sys, re
    
    lines = '''Date: 2006-03-07
    On duty: HP # 01:30 - 03:00
    Observed: Common Welsh Green
    On duty: RW #03:00-04:30
    Observed: none
    On duty: HG # 04:30-06:00
    Observed: Hebridean Black
    '''.split('\n')
    
    for line in lines:
        if re.search('#', line):
            comment = line.split('#')[1]
            print comment
     01:30 - 03:00
    03:00-04:30
     04:30-06:00
    • Output is inconsistent
    • split followed by strip seems clumsy

Match Objects

  • Result of re.search is actually a match object that records what what matched, and where
    • mo.group() returns the whole string that matched the RE
    • mo.start() and mo.end() are the indices of the match's location
    • import re
      
      text = 'abbcb'
      for pattern in ['b+', 'bc*', 'b+c+']:
          mo = re.search(pattern, text)
          print '%s / %s => "%s" (%d, %d)' % (pattern, text, mo.group(), mo.start(), mo.end())
      b+ / abbcb => "bb" (1, 3)
      bc* / abbcb => "b" (1, 2)
      b+c+ / abbcb => "bbc" (1, 4)

Match Groups

  • Every parenthesized subexpression in the RE is a group
    • Group 0 is the entire match
    • Text that matched Nth parentheses (counting from left) is group N
    • mo.group(3) is the text that matched the third subexpression, m.start(3) is where it started

Match Groups (cont.)

  • Extracting comments is now easy:
    import sys, re
    
    lines = '''Date: 2006-03-07
    On duty: HP # 01:30 - 03:00
    Observed: Common Welsh Green
    On duty: RW #03:00-04:30
    Observed: none
    On duty: HG # 04:30-06:00
    Observed: Hebridean Black
    '''.split('\n')
    
    for line in lines:
        match = re.search(r'#\s*(.+)', line)
        if match:
            comment = match.group(1)
            print comment
    01:30 - 03:00
    03:00-04:30
    04:30-06:00

Reversing Columns

  • REs are the power tools of text processing
    • Can do things in one line that would otherwise take many lines of code
  • Example: reverse two-column data
    import re
    
    def reverse_columns(line):
        match = re.search(r'^\s*(\d+)\s+(\d+)\s*$', line)
        if not match:
            return line
        return match.group(2) + ' ' + match.group(1)
    
    tests = [
        ['10 20',    'easy case'],
        [' 30  40 ', 'padding'],
        ['60 70 80', 'too many columns'],
        ['90 end',   'non-numeric']
    ]
    
    for (fixture, title) in tests:
        actual = reverse_columns(fixture)
        print '%s: "%s" => "%s"' % (title, fixture, actual)
    easy case: "10 20" => "20 10"
    padding: " 30  40 " => "40 30"
    too many columns: "60 70 80" => "60 70 80"
    non-numeric: "90 end" => "90 end"

Compiling

[Regular Expressions as Finite State Machines]
  • The RE library compiles patterns into a more concise form for matching
    • Each regular expression becomes a finite state machine
    • Library follows the arcs in the FSM as it reads characters
    • Drawing FSMs is a good way to debug REs

Compiling (cont.)

  • You can improve a program's performance by compiling the RE once, and re-using the compiled form
    • Use re.compile(pattern) to get the compiled RE
    • Its methods have the same names and behavior as the functions in the re module
    • E.g., matcher.search(text) searches text for matches to the RE that was compiled to create matcher

Finding Title Case Words

  • Example: find and print all Title Case words in a document
    import re
    
    # Put pattern outside 'find_all' so that it's only compiled once.
    pattern = re.compile(r'\b([A-Z][a-z]*)\b(.*)')
    
    def find_all(line):
        result = []
        match = pattern.search(line)
        while match:
            result.append(match.group(1))
            match = pattern.search(match.group(2))
        return result
    
    lines = [
        'This has several Title Case words',
        'on Each Line (Some in parentheses).'
    ]
    for line in lines:
        print line
        for word in find_all(line):
            print '\t', word
    This has several Title Case words
    	This
    	Title
    	Case
    on Each Line (Some in parentheses).
    	Each
    	Line
    	Some

Finding All Matches

  • Notice how the function gets all matches:
    • Pattern captures what we want in group 1, and everything else on the line in group 2
    • Each time there's a match, continue the search in the remainder captured in group 2

Using findall

  • Much easier to use the findall method
    import re
    
    lines = [
        'This has several Title Case words',
        'on Each Line (Some in parentheses).'
    ]
    pattern = re.compile(r'\b([A-Z][a-z]*)\b')
    for line in lines:
        print line
        for word in pattern.findall(line):
            print '\t', word
    This has several Title Case words
    	This
    	Title
    	Case
    on Each Line (Some in parentheses).
    	Each
    	Line
    	Some

Reference Material

Pattern Matches Doesn't Match Explanation
a* "", "a", "aa", … "A", "b" * means “zero or more” matching is case sensitive
b+ "b", "bb", … "" + means “one or more”
ab?c "ac", "abc" "a", "abbc" ? means “optional” (zero or one)
[abc] "a", "b", or "c" "ab", "d" […] means “one character from a set”
[a-c] "a", "b", or "c" Character ranges can be abbreviated
[abc]* "", "ac", "baabcab", … Operators can be combined: zero or more choices from "a", "b", or "c"
Regular Expression Operators

Reference Material (cont.)

Method Purpose Example Result
split Split a string on a pattern. re.split('\\s*,\\s*', 'a, b ,c , d') ['a', 'b', 'c', 'd']
findall Find all matches for a pattern. re.findall('\\b[A-Z][a-z]*', 'Some words in Title Case.') ['Some', 'Title', 'Case']
sub Replace matches with new text. re.sub('\\d+', 'NUM', 'If 123 is 456') "If NUM is NUM"
Regular Expression Object Methods

But Wait, There's More

  • We've only scratched the surface
    • Regular expressions have proved to be too useful to remain clean and elegant
  • For example, use pat{N} to match exactly N occurrences of a pattern
    • More generally, pat{M,N} matches between M and N occurrences
  • Most important thing is to build up complex REs one step at a time
    • Write something that matches part of what you're looking for
    • Test it
    • Add to it

Summary

  • Regular expressions are available in almost every language
    • As a library: C/C++, Java, …
    • Built into the language: Perl, Ruby, …
    • Syntax varies slightly, but the ideas are the same
  • For a broader tutorial, see [Wilson 2005]
  • Because regular expressions are very powerful, there is a tendency to try to use them for too many things. But remember a famous saying:
    Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. -- Jamie Zawinski, Netscape engineer

Exercises

  • By default, regular expression matches are greedy: the first term in the RE matches as much as it can, then the second part, and so on. As a result, if you apply the RE X(.*)X(.*) to the string "XaX and XbX", the first group will contain "aX and Xb", and the second group will be empty.

    It's also possible to make REs match reluctantly, i.e., to have the parts match as little as possible, rather than as much. Find out how to do this, and then modify the RE in the previous paragraph so that the first group winds up containing "a", and the second group " and XbX".

  • What is the easiest way to write a case-insensitive regular expression? (Hint: read the documentation on compilation options.)

Exercises

  • What does the VERBOSE option do when compiling a regular expression? Use it to rewrite some of the REs in this lecture in a more readable way.

  • What does the DOTALL option do when compiling a regular expression? Use it to get rid of the call to string.split in the example that finds words ending in vowels.