Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Regular Expressions in Python

What are Regular Expressions (aka. Regexes)?

  • An idea on how to match some pattern in some text.
  • A tool/language that is available in many places.
  • Has many different "dialects"
  • Has many different modes of processing.
  • The grand concept is the same.
  • Uses the following symbols:
() [] {} . * + ? ^ $ | - \ \d \s \w \A \Z \1 \2 \3

What are Regular Expressions good for?

  • Decide if a string is part of a larger string.
  • Validate the format of some value (string) (e.g. is it a decimal number?, is it a hex?)
  • Find if there are repetitions in a string.
  • Analyze a string and fetch parts of if given some loose description.
  • Cut up a string into parts.
  • Change parts of a string.

Examples

Is the input given by the user a number?

(BTW which one is a number:  23, 2.3, .3, 2., 2.3.4.7.12, 2.4e3, abc ?)

Is there a substring in the file that is repeated 3 or more times?

Replaces all occurrences of Python or python by Java ...
... but avoid replacing Monty Python.


Given a text message fetch all the phone numbers:
Fetch numbers that look like 09-1234567
then also fetch +972-2-1234567
and maybe also 09-123-4567
but this #456 is not a phone number


Check if in a given text passing your network there are credit card numbers....


Given a text find if the word "password" is in it and fetch the surrounding text.


Given a log file like this:

[Tue Jun 12 00:01:00 2019] - (3423) - INFO - ERROR log restarted
[Tue Jun 12 09:08:17 2019] - (3423) - INFO - System starts to work
[Tue Jun 13 08:07:16 2019] - (3423) - ERROR - Something is wrong

provide statistics on how many of the different levels of log messages
were seen. Separate the log messages into files.

Where can I use it ?

  • grep, egrep
  • Unix tools such as sed, awk, procmail
  • vi, emacs, other editors
  • text editors such as Multi-Edit
  • .NET languages: C#, C++, VB.NET
  • Java
  • Perl
  • Python
  • PHP
  • Ruby
  • ...
  • Word, Open Office ...
  • PCRE

grep

grep gets a regex and one or more files. It goes over line-by-line all the files and displays the lines where the regex matched. A few examples:

grep python file.xml    # lines that have the string python in them in file.xml.
grep [34] file.xml      # lines that have either 3 or 4 (or both) in file.xml.
grep [34] *.xml         # lines that have either 3 or 4 (or both) in every xml file.
grep [0-9] *.xml        # lines with a digit in them.
egrep '\b[0-9]' *.xml   # only highlight digits that are at the beginning of a number.

Regexes first match

import re

text = 'The black cat climed'
match = re.search(r'lac', text)
if match:
    print("Matching")       # Matching
    print(match.group(0))   # lac

match = re.search(r'dog', text)
if match:
    print("Matching")
else:
    print("Did NOT match")
    print(match)     # None

The search method returns an object or None, if it could not find any match. If there is a match you can call the group() method. Passing 0 to it will return the actual substring that was matched.

  • r
  • re
  • search|re
  • group|re

Match numbers

r|re \d group|re

import re

line = 'There is a phone number 12345 in this row and an age: 23'

match = re.search(r'\d+', line)
if match:
    print(match.group(0))  # 12345

Use raw strings for regular expression: r'a\d'. Especially because \ needs it.

  • \d matches a digit.
    • is a quantifier and it tells \d to match one or more digits.

It matches the first occurrence. Here we can see that the group(0) call is much more interesting than earlier.

Capture

()|re

import re

line = 'There is a phone number 12345 in this row and an age: 23'

match = re.search(r'age: \d+', line)
if match:
  print(match.group(0))  # age: 23


match = re.search(r'age: (\d+)', line)
if match:
    print(match.group(0))  # age: 23
    print(match.group(1))  # 23      the first group of parentheses

    print(match.groups())  # ('23',)
    print(len(match.groups()))  # 1

Parentheses in the regular expression can enclose any sub-expression. Whatever this sub-expression matches will be saved and can be accessed using the group() method.

Capture more

()|re \w|re

import re

line = 'There is a phone number 12345 in this row and an age: 23'

match = re.search(r'(\w+) (\w+): (\d+)', line)
if match:
    print(match.group(0))  # an age: 23  the full match
    print(match.group(1))  # an          the 1st group of parentheses
    print(match.group(2))  # age         the 2nd group of parentheses
    print(match.group(3))  # 23          the 3rd group of parentheses

    # print(match.group(4))  # IndexError: no such group
    print(match.groups())  # ('an', 'age', '23')
    print(len(match.groups()))  # 3

Some groups might match '' or even not match at all, in which case we get None in the appropriate match.group() call and in the match.groups() call

Capture even more

import re

line = 'There is a phone number 12345 in this row and an age: 23'

match = re.search(r'((\w+) (\w+)): (\d+)', line)
if match:
    print(match.group(0))  # an age: 23
    print(match.group(1))  # an age
    print(match.group(2))  # an
    print(match.group(3))  # age
    print(match.group(4))  # 23

    print(match.groups())  # ('an age', 'an', 'age', '23')
    print(len(match.groups()))  # 4

Named capture

\P P

import re

line = 'There is a phone number 12345 in this row and an age: 23'

regex = r'((?P<word>\w+) (?P<key>\w+)): (?P<value>\d+)'

match = re.search(regex, line)
if match:
    print(match.group(0))  # an age: 23
    print(match.group(1))  # an age
    print(match.group(2))  # an
    print(match.group(3))  # age
    print(match.group(4))  # 23

    print(match.group('word'))   # an
    print(match.group('key'))    # age
    print(match.group('value'))  # 23

    print(match.groups())  # ('an age', 'an', 'age', '23')
    print(len(match.groups()))  # 4

matches = re.findall(regex, line)
print(matches) # [('an age', 'an', 'age', '23')]

findall

import re

line1 = 'There is a phone number 12345 in this row and another 42 number'
numbers1 = re.findall(r'\d+', line1)
print(numbers1) # ['12345', '42']

line2 = 'There are no numbers in this row. Not even one.'
numbers2 = re.findall(r'\d+', line2)
print(numbers2) # []

re.findall returns the matched substrings.

findall with capture

import re

line = 'There is a phone number 12345 in this row and another 42 number'
match = re.search(r'\w+ \d+', line)
if match:
    print(match.group(0))   # number 12345

match = re.search(r'\w+ (\d+)', line)
if match:
    print(match.group(0))   # number 12345
    print(match.group(1))   # 12345

matches = re.findall(r'\w+ \d+', line)
print(matches)  # ['number 12345', 'another 42']

matches = re.findall(r'\w+ (\d+)', line)
print(matches)  # ['12345', '42']

findall with capture more than one

import re

line = 'There is a phone number 12345 in this row and another 42 number'
match = re.search(r'(\w+) (\d+)', line)
if match:
    print(match.group(1))   # number
    print(match.group(2))   # 12345

matches = re.findall(r'(\w+) (\d+)', line)
print(matches)  # [('number', '12345'), ('another', '42')]

If there are multiple capture groups then The returned list will consist of tuples.

Any Character

. matches any one character except newline.

For example: #.#

import re

strings = [
    'abc',
    'text: #q#',
    'str: #a#',
    'text #b# more text',
    '#a  and this? #c#',
    '#a  and this? # c#',
    '#@#',
    '#.#',
    '# #',
    '##'
    '###'
]

for s in strings:
    print('str:  ', s)
    match = re.search(r'#.#', s)
    if match:
        print('match:', match.group(0))

If re.DOTALL is given newline will be also matched.

Match dot

. \

import re

cases = [
    "hello!",
    "hello world.",
    "hello. world",
    ".",
]

for case in cases:
    print(case)
    match = re.search(r'.', case)   # Match any character
    if match:
        print(match.group(0))

print("----")

for case in cases:
    print(case)
    match = re.search(r'\.', case)  # Match a dot
    if match:
        print(match.group(0))

print("----")

for case in cases:
    print(case)
    match = re.search(r'[.]', case) # Match a dot
    if match:
        print(match.group(0))

Character classes

[]

We would like to match any string that has any of the #a#, #b#, #c#, #d#, #e#, #f#, #@# or #.#

import re

strings = [
    'abc',
    'text: #q#',
    'str: #a#',
    'text #b# more text',
    '#ab#',
    '#@#',
    '#.#',
    '# #',
    '##'
    '###'
]


for s in strings:
    print('str:  ', s)
    match = re.search(r'#[abcdef@.]#', s)
    if match:
        print('match:', match.group(0))
r'#[abcdef@.]#'
r'#[a-f@.]#'

Common characer classes

\d \w \s

  • \d digit: [0-9] or Unicode Characters in the 'Number, Decimal Digit' Category

  • \w word character [a-zA-Z0-9_] (digits, letters, underscore) or see the Unicode set of digits and letters

  • \s white space: [\f\t\n\r\v ] form-feed, tab, newline, carriage return, vertical-tab, and SPACE

  • Use stand alone: \d or as part of a larger character class: [abc\d]

Negated character class

\D \W \S

  • [^abc] matches any one character that is not 'a', not 'b' and not 'c'.
  • \D not digit [^\d]
  • \W not word character [^\w]
  • \S not white space [^\s]

Character classes summary

a[bc]a      # aba, aca
a[2#=x?.]a  # a2a, a#a, a=a, axa, a?a, a.a
            # inside the character class most of the spec characters lose their
            # special meaning  BUT there are some new special characters
a[2-8]a     # is the same as /a[2345678]a/
a[2-]a      # a2a, a-a        - has no special meaning at the ends
a[-8]a      # a8a, a-a
a[6-C]a     # a6a, a7a ... aCa
              #      characters from the ASCII table: 6789:;&lt;=&gt;?@ABC
a[C-6]a     # Error: "bad character range"

a[^xa]a     # "aba", "aca"  but not "aaa", "axa"    what about "aa" ?
              # ^ as the first character in a character class means 
              # a character that is not in the list
a[a^x]a     # aaa, a^a, axa

Character classes and Unicode characters

import re

text = "👷👸👹👺👻✍👼👽👾👿💀💁💂"

print(text)
#print(chr(128120))
#print(0x1f000)

match = re.search(r"[\U0001f000-\U00020000]+", text)
if match:
    print(match.group(0))

for emoji in text:
    print(emoji, ord(emoji), "{:x}".format(ord(emoji)))

match = re.search(r"[👷-💂]*", text)
print(match.group(0))

Character classes for Hebrew text

import re

text = "שלום כיתה א"
print(text)            # שלום כיתה א
print(ord(text[-1]))   # 1488
print(text[-1])        # א

match = re.search(r"[א-ת]", text)
print(match.group(0))  # ש

match = re.search(r"[א-ת]+", text)
print(match.group(0))  # שלום

match = re.search(r"[ א-ת]*", text)
print(match.group(0))  # שלום כיתה א

match = re.search(r'[\u05d0-\u05eb]+', text)
print(match.group(0))  # שלום

# Hebrew has 22 letters, 5 of them have a different version at the end of the word
# A total of 27 letters
for ix in range(1488, 1488+27):
    print(f"{ix} {chr(ix)}")

# 1488 א
# 1489 ב
# 1490 ג
# 1491 ד
# 1492 ה
# 1493 ו
# 1494 ז
# 1495 ח
# 1496 ט
# 1497 י
# 1498 ך
# 1499 כ
# 1500 ל
# 1501 ם
# 1502 מ
# 1503 ן
# 1504 נ
# 1505 ס
# 1506 ע
# 1507 ף
# 1508 פ
# 1509 ץ
# 1510 צ
# 1511 ק
# 1512 ר
# 1513 ש
# 1514 ת

Match digits

import re

values = [
    '2',
    '٣', # Arabic 3
    '½', # unicode 1/2
    '②', # unicode circled 2
    '߄', # NKO 4 (a writing system for the Manding languages of West Africa)
    '६', # Devanagari aka. Nagari (Indian)
    '_', # underscrore
    '-', # dash
    'a', # Latin a
    'á', # Hungarian
    'א', # Hebrew aleph
]

for val in values:
    print(val)
    match = re.search(r'\d', val)
    if match:
        print('Match ', match.group(0))

    match = re.search(r'\d', val, re.ASCII)
    if match:
        print('Match ASCII ', match.group(0))

Output:

2
Match  2
Match ASCII  2
٣
Match  ٣
½
②
߄
Match  ߄
६
Match  ६
_
-
a
á
א

Word Characters

import re

values = [
    '2',
    '٣', # Arabic 3
    '½', # unicode 1/2
    '②', # unicode circled 2
    '߄', # NKO 4 (a writing system for the Manding languages of West Africa)
    '६', # Devanagari aka. Nagari (Indian)
    '_', # underscrore
    '-', # dash
    'a', # Latin a
    'á', # Hungarian
    'א', # Hebrew aleph

]

for val in values:
    print(val)
    match = re.search(r'\w', val)
    if match:
        print('Match ', match.group(0))

    match = re.search(r'\w', val, re.ASCII)
    if match:
        print('Match ASCII ', match.group(0))

Output:

2
Match  2
Match ASCII  2
٣
Match  ٣
½
Match  ½
②
Match  ②
߄
Match  ߄
६
Match  ६
_
Match  _
Match ASCII  _
-
a
Match  a
Match ASCII  a
á
Match  á
א
Match  א

Exercise: add numbers

Given a file like this:

Foo:1
Foo:2
Foo:3
Foo:4
Foo:5
Foo:6
Foo:7
Foo:8
Bar:23
Foo:23
Foo:11
Foo:9
Bar:8
Zorg:7
  • Add up the scores for each name and print the result.
Foo   : 79
Bar   : 31
Zorg  :  7
  • Make it work also on a file that looks like this:
# Let's start with Foo:1

Foo:1
Foo: 2
Foo :3
Foo : 4
  Foo:5
  Foo: 6
  Foo :7
  Foo : 8

# Let's start Bar with : 23
Bar:23

Foo: 23
Foo:   11
   Foo  : 9
  Bar:  8
Zorg: 7

Solution: add numbers

import sys


def add_grades(filename):
    grades = {}
    with open(filename) as fh:
        for line in fh:
            line = line.rstrip("\n")
            name, grade = line.split(":")
            if name not in grades:
                grades[name] = 0
            grades[name] += int(grade)
    for name in sorted(grades.keys(), key=lambda name: grades[name], reverse=True):
        print(f"{name:6}:{grades[name]:-3}")

if __name__ == '__main__':
    if len(sys.argv) != 2:
        exit(f"Usage: {sys.argv[0]} FILENAME")
    filename = sys.argv[1]
    add_grades(filename)

Solution: add numbers

import sys


def add_grades(filename):
    grades = {}
    with open(filename) as fh:
        for line in fh:
            line = line.rstrip("\n")
            line = line.strip()
            if line.startswith("#"):
                continue
            if line == '':
                continue
            name, grade = line.split(":")
            name = name.strip()
            if name not in grades:
                grades[name] = 0
            grades[name] += int(grade)

    for name in sorted(grades.keys(), key=lambda name: grades[name], reverse=True):
        print(f"{name:6}:{grades[name]:-3}")

if __name__ == '__main__':
    if len(sys.argv) != 2:
        exit(f"Usage: {sys.argv[0]} FILENAME")
    filename = sys.argv[1]
    add_grades(filename)

Solution: add numbers

import re
import sys


def add_grades(filename):
    grades = {}
    with open(filename) as fh:
        for line in fh:
            if re.search(r'^\s*(#.*)?$', line):
                continue
            match = re.search(r'^\s*(\w+)\s*:\s*(\d+)\s*$', line)
            if match:
                name = match.group(1)
                value = int(match.group(2))
            else:
                raise Exception(f"Invalid row: '{line}'")
            if name not in grades:
                grades[name] = 0
            grades[name] += value
        for name in sorted(grades.keys(), key=lambda name: grades[name], reverse=True):
            print(f"{name:6}:{grades[name]:-3}")

if __name__ == '__main__':
    if len(sys.argv) != 2:
        exit(f"Usage: {sys.argv[0]} FILENAME")
    filename = sys.argv[1]
    add_grades(filename)

Optional character

  • ?

Match the word color or the word colour

Regex: r'colou?r'
Input: color
Input: colour
Input: colouur

Regex match 0 or more (the * quantifier)

Any line with two - -es with anything in between.

Regex: r'-.*-'
Input: "ab"
Input: "ab - cde"
Input: "ab - qqqrq -"
Input: "ab -- cde"
Input: "--"

Quantifiers

  • ?

Quantifiers apply to the thing immediately to the left of them.

In this case it is the single character x to the left of the quantifier, but later we'll see it can apply to a character-class or to a sub-expression enclosed in parentheses as well. Whatever is located immediately to the left of the quantifier.

r'ax*a'      # aa, axa, axxa, axxxa, ...
r'ax+a'      #     axa, axxa, axxxa, ...
r'ax?a'      # aa, axa
r'ax{2,4}a'  #          axxa, axxxa, axxxxa
r'ax{3,}a'   #                axxxa, axxxxa, ...
r'ax{17}a'   #                                 axxxxxxxxxxxxxxxxxa

| * | 0- | | + | 1- | | ? | 0-1 | | {n,m} | n-m | | {n,} | n- | | {n} | n |

Quantifiers limit

import re

strings = (
    "axxxa",
    "axxxxa",
    "axxxxxa",
)

for text in strings:
    match = re.search(r'ax{4}', text)
    if match:
        print(f"Match {text}")
        print(match.group(0))
    else:
        print("NOT Match")

Quantifiers on character classes

import re

strings = (
    "-a-",
    "-b-",
    "-x-",
    "-aa-",
    "-ab-",
    "--",
)

for line in strings:
    match = re.search(r'-[abc]-', line)
    if match:
        print(line)
print('=========================')

for line in strings:
    match = re.search(r'-[abc]+-', line)
    if match:
        print(line)
print('=========================')

for line in strings:
    match = re.search(r'-[abc]*-', line)
    if match:
        print(line)


Greedy quantifiers

import re

match = re.search(r'xa*', 'xaaab')
print(match.group(0))

match = re.search(r'xa*', 'xabxaab')
print(match.group(0))

match = re.search(r'a*',  'xabxaab')
print(match.group(0))

match = re.search(r'a*',  'aaaxabxaab')
print(match.group(0))

They match 'xaaa', 'xa' and '' respectively.

Minimal quantifiers

import re

match = re.search(r'a.*b', 'axbzb')
print(match.group(0))

match = re.search(r'a.*?b', 'axbzb')
print(match.group(0))


match = re.search(r'a.*b', 'axy121413413bq')
print(match.group(0))

match = re.search(r'a.*?b', 'axyb121413413q')
print(match.group(0))

Anchors

  • \A

  • \Z

  • ^

  • $

  • \A matches the beginning of the string

  • \Z matches the end of the string

  • ^ matches the beginning of the row (see also re.MULTILINE)

  • $ matches the end of the row but will accept a trailing newline (see also re.MULTILINE)

import re

lines = [
    "text with cat in the middle",
    "cat with dog",
    "dog with cat",
]

for line in lines:
    if re.search(r'cat', line):
        print(line)


print("---")
for line in lines:
    if re.search(r'^cat', line):
        print(line)

print("---")
for line in lines:
    if re.search(r'\Acat', line):
        print(line)

print("---")
for line in lines:
    if re.search(r'cat$', line):
        print(line)

print("---")
for line in lines:
    if re.search(r'cat\Z', line):
        print(line)

Output:

text with cat in the middle
cat with dog
dog with cat
---
cat with dog
---
cat with dog
---
dog with cat
---
dog with cat

Anchors with mulitline

import re

text = """
text with cat in the middle
cat with dog
dog with cat"""

if re.search(r'dog', text):
    print(text)

print("---")
if re.search(r'^dog', text):
    print('Carret dog')

print("---")
if re.search(r'\Adog', text):
    print('A dog')

print("---")
if re.search(r'dog$', text):
    print('$ dog')

print("---")
if re.search(r'dog\Z', text):
    print('Z dog')

print("-----------------")
if re.search(r'^dog', text, re.MULTILINE):
    print('^ dog')

print("---")
if re.search(r'\Adog', text, re.MULTILINE):
    print('A dog')

print("---")
if re.search(r'dog$', text, re.MULTILINE):
    print('$ dog')

print("---")
if re.search(r'dog\Z', text, re.MULTILINE):
    print('Z dog')

Anchors on both end

import re

strings = [
    "123",
    "hello 456 world",
    "hello world",
]

for line in strings:
    if re.search(r'\d+', line):
        print(line)

print('---')

for line in strings:
    if re.search(r'^\d+$', line):
        print(line)


print('---')

for line in strings:
    if re.search(r'\A\d+\Z', line):
        print(line)


Output:

123
hello 456 world
---
123
---
123

Match ISBN numbers

import re

strings = [
    '99921-58-10-7',
    '9971-5-0210-0',
    '960-425-059-0',
    '80-902734-1-6',
    '85-359-0277-5',
    '1-84356-028-3',
    '0-684-84328-5',
    '0-8044-2957-X',
    '0-85131-041-9',
    '0-943396-04-2',
    '0-9752298-0-X',

    '0-975229-1-X',
    '0-9752298-10-X',
    '0-9752298-0-Y',
    '910975229-0-X',
    '-------------',
    '0000000000000',
    '3-3-3-X',
]
for isbn in strings:
    print(isbn)

    if (re.search(r'^[0-9X-]{13}$', isbn)):
        print("match 1")

    if (len(isbn) == 13 and re.search(r'^[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,5}-[0-9X]$', isbn)):
        print("match 2")

Matching a section

import re

text = "This is <a string> with some <sections> marks."

m = re.search(r'<.*>', text)
if m:
    print(m.group(0))

Matching a section - minimal

import re

text = "This is <a string> with some <sections> marks."

m = re.search(r'<.*?>', text)
if m:
    print(m.group(0))

Matching a section negated character class

import re

text = "This is <a string> with some <sections> marks."

m = re.search(r'<[^>]*>', text)
if m:
    print(m.group(0))

Two regex with logical or

All the rows with either 'apple pie' or 'banana pie' in them.

import re

strings = [
    'apple pie',
    'banana pie',
    'apple'
]

for s in strings:
    #print(s)
    match1 = re.search(r'apple pie', s)
    match2 = re.search(r'banana pie', s)
    if match1 or match2:
        print('Matched in', s)

Output:

Matched in apple pie
Matched in banana pie

Alternatives

  • |

Alternatives

import re

strings = [
    'apple pie',
    'banana pie',
    'apple'
]

for line in strings:
    match = re.search(r'apple pie|banana pie', line)
    if match:
        print('Matched in', line)

Output:

Matched in apple pie
Matched in banana pie

Grouping and Alternatives

  • ()

Move the common part in one place and limit the alternation to the part within the parentheses.

import re

strings = [
    'apple pie',
    'banana pie',
    'apple'
]

for line in strings:
    match = re.search(r'(apple|banana) pie', line)
    if match:
        print('Matched in', line)

Output:

Matched in apple pie
Matched in banana pie

Internal variables

  • \1
  • \2
  • \3
  • \4
import re

strings = [
    'banana',
    'apple',
    'infinite loop',
]

for line in strings:
    match = re.search(r'(.)\1', line)
    if match:
        print(match.group(0), 'matched in', line)
        print(match.group(1))

Output:

pp matched in apple
p
oo matched in infinite loop
o

More internal variables

  • \1
import re


line = "one 123 and two 123 and oxxo 23"

match = re.search(r"(.)(.)\2\1", line)
if match:
    print(match.group(1)) # o
    print(match.group(2)) # x

match = re.search(r"(\d\d).*\1", line)
if match:
    print(match.group(1)) # 12

match = re.search(r"(\d\d).*\1.*\1", line)
if match:
    print(match.group(1)) # 23


match = re.search(r"(\d\d).*\1{2,3}",  "45 afjh 4545 kjdhfkh")
if match:
    print(match.group(1)) # 45


# (.{5}).*\1

Regex DNA

  • DNA is built from G, A, T, C
  • Let's create a random DNA sequence
  • Then find the longest repeated sequence in it
import re
import random

chars = ['G', 'A', 'T', 'C']
dna = 'AT'
for i in range(100):
    dna += random.choice(chars)

print(dna)

# finds the first one, not necessarily the longest one
match = re.search(r"([GATC]+).*\1", dna)
if match:
    print(match.group(1))

'''
Generating regexes:

   ([GATC]{1}).*\1
   ([GATC]{2}).*\1
   ([GATC]{3}).*\1
   ([GATC]{4}).*\1
   ...
'''
length = 1
result = ''
while True:
    regex = r'([GATC]{' + str(length) + r'}).*\1'
    #print(regex)
    m = re.search(regex, dna)
    if m:
        result = m.group(1)
        length += 1
    else:
        break

print(result)
print(len(result))

Output:

ATTTTATTGAGTCCTCTCGTGTGGTGTGATTGGGTGCAATTACCCCAAGGGCTCAAGTAATTCCCACATGATGATCAATGAGAGACCTGAATTAGCCATGCA
ATT
GTGTG
5

Regex IGNORECASE

  • IGNORECASE
  • I
import re

s = 'Python'

if (re.search('python', s)):
    print('python matched')

if (re.search('python', s, re.IGNORECASE)):
    print('python matched with IGNORECASE')

DOTALL S (single line)

  • .
  • DOTALL
  • S

if re.DOTALL is given, . will match any character. Including newlines.

import re

line = 'Before <div>content</div> After'

text = '''
Before
<div>
content
</div>
After
'''

match = re.search(r'<div>.*</div>', line)
if match:
    print(f"line '{match.group(0)}'");

match = re.search(r'<div>.*</div>', text)
if match:
    print(f"text '{match.group(0)}'");

print('-' * 10)

match = re.search(r'<div>.*</div>', line, re.DOTALL)
if match:
    print(f"line '{match.group(0)}'");

match = re.search(r'<div>.*</div>', text, re.DOTALL)
if match:
    print(f"text '{match.group(0)}'");

MULTILINE M

  • ^
  • $
  • MULTILINE
  • M

if re.MULTILNE is given, ^ will match beginning of line and $ will match end of line

import re

line = 'Start   blabla End'

text = '''
prefix
Start
blabla
End
postfix
'''

regex = r'^Start[\d\D]*End$'
m = re.search(regex, line)
if (m):
    print('line')

m = re.search(regex, text)
if (m):
    print('text')

print('-' * 10)

m = re.search(regex, line, re.MULTILINE)
if (m):
    print('line')

m = re.search(regex, text, re.MULTILINE)
if (m):
    print('text')
line
----------
line
text

Combine Regex flags

  • |

  • Use the bitwise or for that.

re.MULTILINE | re.DOTALL

Regex VERBOSE X

  • VERBOSE
  • X
import re

email = "foo@bar.com"

m = re.search(r'\w[\w.-]*\@([\w-]+\.)+(com|net|org|uk|hu|il)', email)
if (m):
    print('match 1')


# To make the regex more readable we can break it into rows and add comments:
m = re.search(r'''
                \w[\w.-]*               # username
                \@
                ([\w-]+\.)+             # domain
                (com|net|org|uk|hu|il)  # gTLD
                ''', email, re.VERBOSE)
if (m):
    print('match 2')


# Improvement to make the code *after* the regex more readable using named captures
m = re.search(r'(?P<username>\w[\w.-]*)\@(?P<domain>[\w-]+\.)+(?P<gtld>com|net|org|uk|hu|il)', email)


# Both, use named captures and also break it up to rows.
m = re.search(r'''
              (?P<username>\w[\w.-]*)
              \@
              (?P<domain>[\w-]+\.)+
              (?P<gtld>com|net|org|uk|hu|il)   # I only handle a few because this is just an example
              ''', email, re.VERBOSE)


Substitution

import re

line = "abc123def"

print(re.sub(r'\d+', ' ', line)) # "abc def"
print(line)                      # "abc123def"

print(re.sub(r'x', ' y', line))  # "abc123def"
print(line)                      # "abc123def"

print(re.sub(r'([a-z]+)(\d+)([a-z]+)', r'\3\2\1', line))   #  "def123abc"
print(re.sub(r'''
([a-z]+)     # letters
(\d+)        # digits
([a-z]+)     # more letters
''', r'\3\2\1', line, flags=re.VERBOSE))   #  "def123abc"

print(re.sub(r'...', 'x', line))             # "xxx"
print(re.sub(r'...', 'x', line, count=1))    # "x123def"

print(re.sub(r'(.)(.)', r'\2\1', line))            # "ba1c32edf"
print(re.sub(r'(.)(.)', r'\2\1', line, count=2))   # "ba1c23def"

Substitution and MULTILINE - remove leading spaces

import re

text = """  First row
Second row
  Third row
"""

print(text)
print('-----')
print(re.sub(r'^\s+', '', text))
print('-----')
print(re.sub(r'\A\s+', '', text, flags=re.MULTILINE))
print('-----')
print(re.sub(r'^\s+', '', text, flags=re.MULTILINE))

Output:

  First row
Second row
  Third row

-----
First row
Second row
  Third row

-----
First row
Second row
  Third row

-----
First row
Second row
Third row

findall capture

If there are parentheses in the regex, it will return tuples of the matches

import re

line = 'There is a phone number 83795 in this row and another 42 number'
print(line)

search = re.search(r'(\d)(\d)', line)
if search:
  print(search.group(1))   # 8
  print(search.group(2))   # 3

matches = re.findall(r'(\d)(\d)', line)
if matches:
  print(matches)  # [('8', '3'), ('7', '9'), ('4', '2')]

matches = re.findall(r'(\d)\D*', line)
if matches:
  print(matches)  # [('8', '3', '7', '9', '5', '4', '2')]

matches = re.findall(r'(\d)\D*(\d?)', line)
print(matches)  # [('8', '3'), ('7', '9'), ('5', '4'), ('2', '')]

matches = re.findall(r'(\d).*?(\d)', line)
print(matches) # [('8', '3'), ('7', '9'), ('5', '4')]

matches = re.findall(r'(\d+)\D+(\d+)', line)
print(matches) # [('83795', '42')]

matches = re.findall(r'(\d+).*?(\d+)', line)
print(matches) # [('83795', '42')]

matches = re.findall(r'\d', line)
print(matches) # ['8', '3', '7', '9', '5', '4', '2']

Fixing dates

In the input we get dates like this 2010-7-5 but we would like to make sure we have two digits for both days and months: 2010-07-05

import re

def test_date(function):
    dates = {
        '2010-7-5'   : '2010-07-05',
        '2010-11-5'  : '2010-11-05',
        '2010-07-5'  : '2010-07-05',
        '2010-07-05' : '2010-07-05',
        '2010-7-15'  : '2010-07-15',
    }

    failures = 0
    for original in sorted(dates.keys()):
        result = function(original)

        if result != dates[original]:
            failures += 1
            print(f"      old: {original}")
            print(f"      new: {result}")
            print(f" expected: {dates[original]}")
            print("")
    if failures == 0:
        print("Everything looks good")
    else:
        exit(failures)

Fixing dates - 1

from date import test_date
import re

def fix_date1(date):
    return re.sub(r'-(\d)', r'-0\1', date)

test_date(fix_date1)

Output:

      old: 2010-07-05
      new: 2010-007-005
 expected: 2010-07-05

      old: 2010-07-5
      new: 2010-007-05
 expected: 2010-07-05

      old: 2010-11-5
      new: 2010-011-05
 expected: 2010-11-05

      old: 2010-7-15
      new: 2010-07-015
 expected: 2010-07-15

Fixing dates - 2

from date import test_date
import re

def fix_date2(date):
    return re.sub(r'-(\d)-', r'-0\1', date)

test_date(fix_date2)

Output:

      old: 2010-07-5
      new: 2010-07-5
 expected: 2010-07-05

      old: 2010-11-5
      new: 2010-11-5
 expected: 2010-11-05

      old: 2010-7-15
      new: 2010-0715
 expected: 2010-07-15

      old: 2010-7-5
      new: 2010-075
 expected: 2010-07-05

Fixing dates - 3

from date import test_date
import re

def fix_date3(date):
    return re.sub(r'-(\d)(-|$)', r'-0\1\2', date)

test_date(fix_date3)

Output:

      old: 2010-7-5
      new: 2010-07-5
 expected: 2010-07-05

Fixing dates - 4

from date import test_date
import re

def fix_date4(date):
    return re.sub(r'-(\d)\b', r'-0\1', date)

test_date(fix_date4)

Output:

Everything looks good

Anchor edge of word

  • \b

  • \b beginning of word or end of word

import re

text = 'x a xyb x-c qwer_  ut!@y'

print(re.findall(r'.\b.', text))

print(re.findall(r'\b.', 'a b '))

print(re.findall(r'.\b', 'a b '))

Output:

['x ', 'a ', 'b ', 'x-', 'c ', '_ ', ' u', 't!', '@y']
['a', ' ', 'b', ' ']
['a', ' ', 'b']

Double numbers

We have a string that has some numbers in it. We would like to double the numbers.

In the first example we can see a relatively simple way of doubling the numbers. We captuter a number using the (\d+) expression that will save the current number in \1 and then we include it twice: \1\1. This will convert a number like 1 to 11. This is nice, but probably not what we wanted. We wanted to convert 1 to 2 and 34 to 68.

We can't do that with plain regular expressions and substitutions as that is all string-based. The plain substitution can only move around characters, but it cannot do any complex operations on the and thus cannot compute anything.

However, if the substitution part is a function then Python will call that function passing in the match object and whatever the function returns will be the replacement string. This function can be a regular function defined with def or a lambda expression.

In the second example we see the solution with lambda-expression.

The 3rd examples is the same solution but in a very step-by-step way with lots of temporary variables. This will hopefully help understand the lambda-expression in the 2nd example.

import re

text = "This is 1 string with 3 numbers: 34"

new_text = re.sub(r'(\d+)', r'\1\1', text)
print(new_text)   # This is 11 string with 33 numbers: 3434

double_numbers = re.sub(r'(\d+)', lambda match: str(2 * int(match.group(0))), text)
print(double_numbers)  # This is 2 string with 6 numbers: 68


# The same but in a function

def double(match):
    matched_number_as_str = match.group(0)
    number = int(matched_number_as_str)
    doubled_number = 2 * number
    doubled_number_as_str = str(doubled_number)
    return doubled_number_as_str

double_numbers = re.sub(r'(\d+)', double, text)
print(double_numbers)  # This is 2 string with 6 numbers: 68

Remove spaces

line = "  ab cd  "

res = line.lstrip(" ")
print(f"'{res}'")        # 'ab cd  '

res = line.rstrip(" ")
print(f"'{res}'")        # '  ab cd'

res = line.strip(" ")
print(f"'{res}'")        # 'ab cd'

res = line.replace(" ", "")
print(f"'{res}'")        # 'abcd'

Replace string in Assembly code

At a company we had some Assembly code that looks like the text file. In case you don't know, Assembly is a very low level programming language. Here we have some sample code that uses variables like A and registers like R1, R2, and R3.

We had this code, but because the hardware changed we had to make changes to the code and rename the registies R1 to be R2, R2 to be R3, and R3 to be R1. We cannot just simple do these steps one after the other becasue once we renamed R1 to be R2 we won't know which of the R2-s do we need to rename and which are new. So someone came up with the idea to use R4 as a temporary name and start by renaming R1 to R4, R3 to R1, R2 to R3, and finally the temporary R4 to R2.

As you could see in our solution coed.

It worked all very smoothly till we turned on the device that immediately emitted smoke. It did not pass the "smoke test".

As it turns out in the original text there were also R4 registers that we have not noticed and they were all renamed to be R2.

The first idea to improve our converter program was to use some other temporary string that for sure cannot be in the code, such as QQRQ, but then we arrived to the conclusion that there are better ways to solve this.

mv A, R3
mv R2, B
mv R1, R3
mv B1, R4
add A, R1
add B, R1
add R1, R2
add R3, R3
add R21, X
add R12, Y
mv X, R2
import sys
import re

if len(sys.argv) != 2:
    exit(f"Usage: {sys.argv[0]} FILENAME")

filename = sys.argv[1]

with open(filename) as fh:
    code = fh.read()

code = re.sub(r'R1', 'R4', code)
code = re.sub(r'R3', 'R1', code)
code = re.sub(r'R2', 'R3', code)
code = re.sub(r'R4', 'R2', code)

print(code)

Replace string in Assembly code - using mapping dict

The first imprvement was to create a dictionary with the mapping from old string to new string and then have a regex that will match exactly the 3 possible original string. In the substitute part we'll have to use a function as we need the current matching object to access the current match.

The function can be either a lambda-expression as in the first solution or a fully defined function as in the seconde solution that I added only to make it easier to understand the first solution.

This is a nice and working solution, but it has two issues.

In the regex used a character class because we assumed that there are only going to be on-digit registries. If you look at the original Assembly code you can see there are also R12 and R21.

In addition we now have data duplication. If we change the mapping adding a new original string or removing one, we'll also have to remember to update the regex. It is not DRY.

import sys
import re

if len(sys.argv) != 2:
    exit(f"Usage: {sys.argv[0]} FILENAME")

filename = sys.argv[1]

with open(filename) as fh:
    code = fh.read()


mapping = {
    'R1' : 'R2',
    'R2' : 'R3',
    'R3' : 'R1',
}

code = re.sub(r'\b(R[123])\b', lambda match: mapping[match.group(1)], code)
print(code)

# The same but now with a named function for clarity
def replace(match):
    original = match.group(1)
    return mapping[original]

code = re.sub(r'\b(R[123])\b', replace, code)
print(code)

Replace string in Assembly code - using alternatives

We can solve the first issue by changing the regex. Instead of using a character class, we use alternatives (vertical line, aka. pipe) and fully write down the original strings.

The rest of the code is the same and the second issue is not solved yet, we still have to make sure the keys of the dictionary and the values in the regex are the same.

However this solution makes it easier to solve the second issue as well.

import sys
import re

if len(sys.argv) != 2:
    exit(f"Usage: {sys.argv[0]} FILENAME")

filename = sys.argv[1]

with open(filename) as fh:
    code = fh.read()

mapping = {
    'R1'  : 'R2',
    'R2'  : 'R3',
    'R3'  : 'R1',
    'R12' : 'R21',
    'R21' : 'R12',
}

code = re.sub(r'\b(R1|R2|R3|R12)\b', lambda match: mapping[match.group(1)], code)

print(code)

Replace string in Assembly code - generate regex

In this solution we generate the regex from the keys of the mapping dictionary.

Once we have this we also opened other opportunities for improvement. Now that all the replacement mapping comes from a regex we have separated the "data" from the "code". We can now decide to read in the mapping from an Excel file (for example). That way we can hand over the mapping creation to someone who does not know Python. Our code will take that file, read the mapping from the spreadsheet, create the mapping dictionary, create the regex and do the work.

import sys
import re

if len(sys.argv) != 2:
    exit(f"Usage: {sys.argv[0]} FILENAME")

filename = sys.argv[1]

with open(filename) as fh:
    code = fh.read()

mapping = {
    'R1'  : 'R2',
    'R2'  : 'R3',
    'R3'  : 'R1',
    'R12' : 'R21',
    'R21' : 'R12',
}

regex = r'\b(' + '|'.join(mapping.keys()) + r')\b'
print(regex)

code = re.sub(regex, lambda match: mapping[match.group(1)], code)

print(code)

Full example of previous

import sys
import os
import time
import re

if len(sys.argv) <= 1:
    exit(f"Usage: {sys.argv[0]} INFILEs")

conversion = {
    'R1'  : 'R2',
    'R2'  : 'R3',
    'R3'  : 'R1',
    'R12' : 'R21',
    'R21' : 'R12',
}
#print(conversion)

def replace(mapping, files):
    regex = r'\b(' + '|'.join(mapping.keys()) + r')\b'
    #print(regex)
    ts = time.time()

    for filename in files:
        with open(filename) as fh:
            data = fh.read()
        data = re.sub(regex, lambda match: mapping[match.group(1)], data)
        os.rename(filename, f"{filename}.{ts}")       # backup with current timestamp
        with open(filename, 'w') as fh:
            fh.write(data)

replace(conversion, sys.argv[1:]);

Split with regex

fname    =    Foo
lname    = Bar
email=foo@bar.com
import sys
import re

# data: field_value_pairs.txt
if len(sys.argv) != 2:
    exit(f"Usage: {sys.argv[0]} filename")

filename = sys.argv[1]

with open(filename) as fh:
    for line in fh:
        line = line.rstrip("\n")
        field, value = re.split(r'\s*=\s*', line)
        print(f"{value}={field}")
Foo=fname
Bar=lname
foo@bar.com=email

Exercises: Regexes part 1

Pick up a file with some text in it. Write a script (one for each item) that prints out every line from the file that matches the requirement. You can use the script at the end of the page as a starting point but you will have to change it!

  • has a 'q'
  • starts with a 'q'
  • has 'th'
  • has an 'q' or a 'Q'
  • has a '*' in it
  • starts with an 'q' or an 'Q'
  • has both 'a' and 'e' in it
  • has an 'a' and somewhere later an 'e'
  • does not have an 'a'
  • does not have an 'a' nor 'e'
  • has an 'a' but not 'e'
  • has at least 2 consecutive vowels (a,e,i,o,u) like in the word "bear"
  • has at least 3 vowels
  • has at least 6 characters
  • has at exactly 6 characters
  • all the words with either 'Bar' or 'Baz' in them
  • all the rows with either 'apple pie' or 'banana pie' in them
  • for each row print if it was apple or banana pie?
  • Bonus: Print if the same word appears twice in the same line
  • Bonus: has a double character (e.g. 'oo')
import sys
import re

if len(sys.argv) != 2:
    print("Usage:", sys.argv[0], "FILE")
    exit()

filename = sys.argv[1]
with open(filename, 'r') as fh:
    for line in fh:
        print(line, end=" ")

        match = re.search(r'REGEX1', line)
        if match:
            print("   Matching 1", match.group(0))

        match = re.search(r'REGEX2', line)
        if match:
            print("   Matching 2", match.group(0))

Exercise: Regexes part 2

Write functions that returns true if the given value is a

  • Hexadecimal number
  • Octal number
  • Binary number

Write a function that given a string it return true if the string is a number. As there might be several definitions of what is the number create several solutions one for each definition:

  • Non negative integer.
  • Integer. (Will you also allow + in front of the number or only - ?
  • Real number. (Do you allow .3 ? What about 2. ?
  • In scientific notation. (something like this: 2.123e4 )
23
2.3
2.3.4
2.4e3
abc

Exercise: Sort SNMP numbers

Given a file with SNMP numbers (one number on every line) print them in sorted order comparing the first number of each SNMP number first. If they are equal then comparing the second number, etc...

input:

1.2.7.6
4.5.7.23
1.2.7
1.12.23
2.3.5.7.10.8.9
1.2.7.5

output:

1.2.7
1.2.7.5
1.2.7.6
1.12.23
2.3.5.7.10.8.9
4.5.7.23

Exercise: parse hours log file and create report

The log file looks like this

{% embed include file="src/examples/regex/timelog.log)

the report should look something like this:

09:20-11:00 Introduction
11:00-11:15 Exercises
11:15-11:35 Break
11:35-12:30 Numbers and strings
12:30-13:30 Lunch Break
13:30-14:10 Exercises
14:10-14:30 Solutions
14:30-14:40 Break
14:40-15:40 Lists
15:40-17:00 Exercises
17:00-17:30 Solutions

09:30-10:30 Lists and Tuples
10:30-10:50 Break
10:50-12:00 Exercises
12:00-12:30 Solutions
12:30-12:45 Dictionaries
12:45-14:15 Lunch Break
14:15-16:00 Exercises
16:00-16:15 Solutions
16:15-16:30 Break
16:30-17:00 Functions
17:00-17:30 Exercises

Break                      65 minutes    6%
Dictionaries               15 minutes    1%
Exercises                 340 minutes   35%
Functions                  30 minutes    3%
Introduction              100 minutes   10%
Lists                      60 minutes    6%
Lists and Tuples           60 minutes    6%
Lunch Break               150 minutes   15%
Numbers and strings        55 minutes    5%
Solutions                  95 minutes    9%

Exercise: Parse ini file

An ini file has sections starting by the name of the section in square brackets and within each section there are key = value pairs with optional spaces around the "=" sign. The keys can only contain letters, numbers, underscore or dash. In addition there can be empty lines and lines starting with # which are comments.

Given a filename, generate a 2 dimensional hash and then print it out. Example ini file:

{% embed include file="src/examples/regex/inifile.ini)

If you print it, it should look like this (except of the nice formatting).

{
    'alpha': {
        'base': 'moon',
        'ship': 'alpha 3'
     },
     'earth': {
         'base': 'London',
         'ship': 'x-wing'
     }
}

Exercise: Replace Python

Write a script called replace_python.py that given a file will replace all occurrences of "Python" or "python" by Java, but will avoid replacing the word in Monty Python. It prints the resulting rows to the screen.

For example given the input:

Just a line without either of the languages.
Line with both Python and python
A line with Monty Python.
And a  line with Monty Python and Python.

Will print:

Just a line without either of the languages.
Line with both Java and java
A line with Monty Python.
And a  line with Monty Python and Java.

Exercise: Extract phone numbers

Given a text message fetch all the phone numbers:
Fetch numbers that look like 09-1234567
then also fetch +972-2-1234567
and maybe also 09-123-4567
This 123 is not a phone number.

Solution: Sort SNMP numbers

import sys

def process(filename):
   snmps = []
   with open(filename) as fh:
       for row in fh:
           snmps.append({
               'orig': row.rstrip(),
           })
   #print(snmps)

   max_number_of_parts = 0
   max_number_of_digits = 0
   for snmp in snmps:
       snmp['split'] = snmp['orig'].split('.')
       max_number_of_parts  = max(max_number_of_parts, len(snmp['split']))
       for part in snmp['split']:
           max_number_of_digits = max(max_number_of_digits, len(part))

   padding = "{:0" + str(max_number_of_digits)  +  "}"
   #print(padding)
   for snmp in snmps:
       padded = []
       padded_split = snmp['split'] + ['0'] * (max_number_of_parts - len(snmp['split']))

       for part in padded_split:
           padded.append(padding.format( int(part)))
       snmp['padded'] = padded
       snmp['joined'] = '.'.join(padded)


   #print(snmps)
   #print(max_number_of_parts)
   #print(max_number_of_digits)

   snmps.sort(key = lambda e: e['joined'])
   sorted_snmps = []
   for snmp in snmps:
       sorted_snmps.append( snmp['orig'] )
   for snmp in sorted_snmps:
      print(snmp)

# get the max number of all the snmp parts
# make each snmp the same length
# pad each part to that length with leading 0s

if len(sys.argv) < 2:
   exit("Usage: {} FILENAME".format(sys.argv[0]))
process(sys.argv[1])

Solution: parse hours log file and give report

import sys


if len(sys.argv) < 2:
   exit("Usage: {} FILENAME".format(sys.argv[0]))



data = {}

def read_file(filename):
   entries = []
   with open(filename) as fh:
       for row in fh:
           row = row.rstrip("\n")
           if row == '':
               process_day(entries)
               entries = []
               continue
           #print(row)
           time, title = row.split(" ", 1)
           #print(time)
           #print(title)
           #print('')

           entries.append({
               'start': time,
               'title': title,
           })
       process_day(entries)

def process_day(entries):
   for i in range(len(entries)-1):
       start = entries[i]['start']
       title = entries[i]['title']
       end   = entries[i+1]['start']
       print("{}-{} {}".format(start, end, title))

       # manual way to parse timestamp and calculate elapsed time
       # as we have not learned to use the datetim module yet
       start_hour, start_min = start.split(':')
       end_hour, end_min = end.split(':')
       start_in_min = 60*int(start_hour) + int(start_min)
       end_in_min = 60*int(end_hour) + int(end_min)
       elapsed_time = end_in_min - start_in_min
       #print(elapsed_time)

       if title not in data:
           data[title] = 0
       data[title] += elapsed_time


   print('')

def print_summary():
   total = 0
   for val in data.values():
       total += val

   for key in sorted( data.keys() ):
       print("{:20}     {:4} minutes  {:3}%".format(key, data[key], int(100 * data[key]/total)))


read_file( sys.argv[1] )
print_summary()


Solution: Processing INI file manually


# comment

      # deep comment

outer = 42

[person]
fname = Foo
lname=Bar
phone =    123

[company]
name = Acme Corp.
phone = 456

import sys
import re

# Sample input data.ini

def parse():
    if len(sys.argv) != 2:
        exit("Usage: {} FILEAME".format(sys.argv[0]))
    filename = sys.argv[1]
    data = {}
    # print("Dealing with " + filename)
    with open(filename) as fh:
        section = '__DEFAULT__'
        for line in fh:
            if re.match(r'^\s*(#.*)?$', line):
                continue
            match = re.match(r'^\[([^\]]+)\]\s*$', line)
            if (match):
                # print('Section "{}"'.format(m.group(1)))
                section = match.group(1)
                continue
            match = re.match(r'^\s*(.+?)\s*=\s*(.*?)\s*$', line)
            if match:
                # print 'field :"{}"  value: "{}"'.format(m.group(1), m.group(2))
                if not data.get(section):
                    data[section] = {}
                data[section][ match.group(1) ] = match.group(2)

    return data

if __name__ == '__main__':
    ini = parse()
    print(ini)

Solution: Processing config file

  • ConfigParse
[person]
fname = Foo
lname=Bar
phone =    123

# comment

      # deep comment


[company]
name = Acme Corp.
phone = 456

import configparser
import sys

def parse():
  if len(sys.argv) != 2:
    print("Usage: " + sys.argv[0] + "  FILEAME")
    exit()
  filename = sys.argv[1]

  cp = configparser.RawConfigParser()
  cp.read(filename)
  return cp

ini = parse()

for section in ini.sections():
  print(section)
  for v in ini.items(section):
    print("  {}  = {}".format(v[0], v[1]))

Solution: Extract phone numbers

import re

filename = "phone.txt"
with open(filename) as fh:
    for line in fh:
        match = re.search(r'''\b
            (
                \d\d-\d{7}
                |
                \d\d\d-\d-\d{7}
                |
                \d\d-\d\d\d-\d\d\d\d
            )\b''', line, re.VERBOSE)
        if match:
            print(match.group(1))

Regular Expressions Cheat sheet

| Expression | Meaning | | a | Just an 'a' character | | . | any character except new-line | | [bgh.] | one of the chars listed in the character class b,g,h or . | | [b-h] | The same as [bcdefgh] | | [a-z] | Lower case letters | | [b-] | The letter b or - | | [^bx] | Anything except b or x | | \w | Word characters: [a-zA-Z0-9_] | | \d | Digits: [0-9] | | \s | [\f\t\n\r ] form-feed, tab, newline, carriage return and SPACE | | \W | [^\w] | | \D | [^\d] | | \S | [^\s] | | a* | 0-infinite 'a' characters | | a+ | 1-infinite 'a' characters | | a? | 0-1 'a' characters | | a{n,m} | n-m 'a' characters | | ( ) | Grouping and capturing | | | | Alternation | | \1, \2 | Capture buffers | | ^ $ | Beginning and end of string anchors |

re

Fix bad JSON

{
   subscriptions : [
      {
         name : "Foo Bar",
         source_name : "pypi",
         space names : [
            "Foo", "Bar"
         ]
      }
   ]
}
import re, json, os

json_file = os.path.join(
    os.path.dirname(__file__),
    'bad.json'
)
with open(json_file) as fh:
    data = json.load(fh)
    # ValueError: Expecting property name: line 2 column 4 (char 5)
import re, json, os

def fix(s):
    return re.sub(r'(\s)([^:\s][^:]+[^:\s])(\s+:)', r'\1"\2"\3', s)

json_file = os.path.join(
    os.path.dirname(__file__),
    'bad.json'
)
with open(json_file) as fh:
    bad_json_rows = fh.readlines()
    json_str = ''.join(map(fix, bad_json_rows))
    print(json_str)
    data = json.loads(json_str)
    print(data)


Fix very bad JSON

[
{
    TID : "t-0_login_sucess"
    Test :
    [
        {SetValue : { uname : "Zorg", pass : "Rules"} },
        {DoAction : "login"},
        {CheckResult: [0, LOGGED_IN]}
    ]
},
{ TID : "t-1_login_failure", Test : [ {SetValue :
{ uname : "11", pass : "im2happy78"} },
{DoAction : "login"}, {CheckResult: [-1000, LOGGED_OUT]} ] }
]
import re, json, os

json_file = os.path.join(
    os.path.dirname(__file__),
    'very_bad.json'
)
with open(json_file, 'r') as fh:
    bad_json = fh.read()
    #print(bad_json)
    improved_json  = re.sub(r'"\s*$', '",', bad_json, flags=re.MULTILINE)
    #print(improved_json)

    # good_json = re.sub(r'(?<!")(?P<word>[\w-]+)\b(?!")', '"\g<word>"',
    #   improved_json)
    # good_json = re.sub(r'(?<[\{\s])(?P<word>[\w-]+)(?=[:\s])', '"\g<word>"',
    #   improved_json)
    # good_json = re.sub(r'([\{\[\s])(?P<word>[\w-]+)([:,\]\s])', '\1"\g<word>"\3',
    #   improved_json)
    good_json = re.sub(r'(?<=[\{\[\s])(?P<word>[\w-]+)(?=[:,\]\s])', '"\g<word>"',
      improved_json)
    #print(good_json)

# with open('out.js', 'w') as fh:
#     fh.write(good_json)

data = json.loads(good_json)
print(data)

Raw string or escape

  • \
  • r

Let's try to check if a string contains a back-slash?

import re

txt = 'text with slash \ and more text'
print(txt)         # text with slash \ and more text

# m0 = re.search('\', txt)
    # SyntaxError: EOL while scanning string literal

# m0 = re.search('\\', txt)
    # Exception:  sre_constants.error: bogus escape (end of line)
    # because the regex engine does not know what to do with a single \

m1 = re.search('\\\\', txt)
if m1:
    print('m1')    # m1

m2 = re.search(r'\\', txt)
if m2:
    print('m2')    # m2

Remove spaces regex

This is not necessary as we can use rstrip, lstrip, and replace.

import re

line = "  ab cd  "

res = re.sub(r'^\s+', '', line)  # leading
print(f"'{res}'")

res = re.sub(r'\s+$', '', line)  # trailing
print(f"'{res}'")


both ends:

re.sub(r'\s*(.*)\s*$', r'\1', line)  #  " abc " =>  "abc "  because of the greediness
re.sub('^\s*(.*?)\s*$', '\1', line)  #   " abc " =>  "abc"  minimal match

Regex Unicode

Python 3.8 required


print("\N{GREEK CAPITAL LETTER DELTA}")

print("\u05E9")
print("\u05DC")
print("\u05D5")
print("\u05DD")
print("\u262E")
print("\u1F426")   # "bird"

print("\u05E9\u05DC\u05D5\u05DD \u262E")

Hello World!
Szia Világ!
!שלום עולם
import re

filename = "mixed.txt"

with open(filename) as fh:
    lines = fh.readlines()
for line in lines:
    if re.search('\N{IN HEBREW}', line):
        print(line)

Anchors Other example

  • \A
  • \Z
  • ^
  • $
import re

strings = [
    "123-XYZ-456",
    "a 123-XYZ-456 b",
    "a 123-XYZ-456",
    "123-XYZ-456 b",
    "123-XYZ-456\n",
]

regexes = [
    r'\d{3}-\w+-\d{3}',
    r'^\d{3}-\w+-\d{3}',
    r'\d{3}-\w+-\d{3}$',
    r'^\d{3}-\w+-\d{3}$',
    r'^\d{3}-\w+-\d{3}\Z',
    r'\A\d{3}-\w+-\d{3}\Z',
]

for r in regexes:
    print(r)
    for s in strings:
        #print(r, s)
        if (re.search(r, s)):
            print('   ', s)
    print('-' * 10)