Regular Expressions in Python
What are Regular Expressions (aka. Regexes)?
- An idea on how to match some pattern in some text.
- A tool/language that is available in many places.
- Has many different "dialects"
- Has many different modes of processing.
- The grand concept is the same.
- Uses the following symbols:
() [] {} . * + ? ^ $ | - \ \d \s \w \A \Z \1 \2 \3
What are Regular Expressions good for?
- Decide if a string is part of a larger string.
- Validate the format of some value (string) (e.g. is it a decimal number?, is it a hex?)
- Find if there are repetitions in a string.
- Analyze a string and fetch parts of if given some loose description.
- Cut up a string into parts.
- Change parts of a string.
Examples
Is the input given by the user a number?
(BTW which one is a number: 23, 2.3, .3, 2., 2.3.4.7.12, 2.4e3, abc ?)
Is there a substring in the file that is repeated 3 or more times?
Replaces all occurrences of Python or python by Java ...
... but avoid replacing Monty Python.
Given a text message fetch all the phone numbers:
Fetch numbers that look like 09-1234567
then also fetch +972-2-1234567
and maybe also 09-123-4567
but this #456 is not a phone number
Check if in a given text passing your network there are credit card numbers....
Given a text find if the word "password" is in it and fetch the surrounding text.
Given a log file like this:
[Tue Jun 12 00:01:00 2019] - (3423) - INFO - ERROR log restarted
[Tue Jun 12 09:08:17 2019] - (3423) - INFO - System starts to work
[Tue Jun 13 08:07:16 2019] - (3423) - ERROR - Something is wrong
provide statistics on how many of the different levels of log messages
were seen. Separate the log messages into files.
Where can I use it ?
- grep, egrep
- Unix tools such as sed, awk, procmail
- vi, emacs, other editors
- text editors such as Multi-Edit
- .NET languages: C#, C++, VB.NET
- Java
- Perl
- Python
- PHP
- Ruby
- ...
- Word, Open Office ...
- PCRE
grep
grep gets a regex and one or more files. It goes over line-by-line all the files and displays the lines where the regex matched. A few examples:
grep python file.xml # lines that have the string python in them in file.xml.
grep [34] file.xml # lines that have either 3 or 4 (or both) in file.xml.
grep [34] *.xml # lines that have either 3 or 4 (or both) in every xml file.
grep [0-9] *.xml # lines with a digit in them.
egrep '\b[0-9]' *.xml # only highlight digits that are at the beginning of a number.
Regexes first match
import re
text = 'The black cat climed'
match = re.search(r'lac', text)
if match:
print("Matching") # Matching
print(match.group(0)) # lac
match = re.search(r'dog', text)
if match:
print("Matching")
else:
print("Did NOT match")
print(match) # None
The search method returns an object or None, if it could not find any match. If there is a match you can call the group() method. Passing 0 to it will return the actual substring that was matched.
- r
- re
- search|re
- group|re
Match numbers
r|re \d group|re
import re
line = 'There is a phone number 12345 in this row and an age: 23'
match = re.search(r'\d+', line)
if match:
print(match.group(0)) # 12345
Use raw strings for regular expression: r'a\d'. Especially because \ needs it.
- \d matches a digit.
-
- is a quantifier and it tells \d to match one or more digits.
It matches the first occurrence.
Here we can see that the group(0)
call is much more interesting than earlier.
Capture
()|re
import re
line = 'There is a phone number 12345 in this row and an age: 23'
match = re.search(r'age: \d+', line)
if match:
print(match.group(0)) # age: 23
match = re.search(r'age: (\d+)', line)
if match:
print(match.group(0)) # age: 23
print(match.group(1)) # 23 the first group of parentheses
print(match.groups()) # ('23',)
print(len(match.groups())) # 1
Parentheses in the regular expression can enclose any sub-expression. Whatever this sub-expression matches will be saved and can be accessed using the group() method.
Capture more
()|re \w|re
import re
line = 'There is a phone number 12345 in this row and an age: 23'
match = re.search(r'(\w+) (\w+): (\d+)', line)
if match:
print(match.group(0)) # an age: 23 the full match
print(match.group(1)) # an the 1st group of parentheses
print(match.group(2)) # age the 2nd group of parentheses
print(match.group(3)) # 23 the 3rd group of parentheses
# print(match.group(4)) # IndexError: no such group
print(match.groups()) # ('an', 'age', '23')
print(len(match.groups())) # 3
Some groups might match '' or even not match at all, in which case we get None in the appropriate match.group() call and in the match.groups() call
Capture even more
import re
line = 'There is a phone number 12345 in this row and an age: 23'
match = re.search(r'((\w+) (\w+)): (\d+)', line)
if match:
print(match.group(0)) # an age: 23
print(match.group(1)) # an age
print(match.group(2)) # an
print(match.group(3)) # age
print(match.group(4)) # 23
print(match.groups()) # ('an age', 'an', 'age', '23')
print(len(match.groups())) # 4
Named capture
\P P
import re
line = 'There is a phone number 12345 in this row and an age: 23'
regex = r'((?P<word>\w+) (?P<key>\w+)): (?P<value>\d+)'
match = re.search(regex, line)
if match:
print(match.group(0)) # an age: 23
print(match.group(1)) # an age
print(match.group(2)) # an
print(match.group(3)) # age
print(match.group(4)) # 23
print(match.group('word')) # an
print(match.group('key')) # age
print(match.group('value')) # 23
print(match.groups()) # ('an age', 'an', 'age', '23')
print(len(match.groups())) # 4
matches = re.findall(regex, line)
print(matches) # [('an age', 'an', 'age', '23')]
findall
import re
line1 = 'There is a phone number 12345 in this row and another 42 number'
numbers1 = re.findall(r'\d+', line1)
print(numbers1) # ['12345', '42']
line2 = 'There are no numbers in this row. Not even one.'
numbers2 = re.findall(r'\d+', line2)
print(numbers2) # []
re.findall returns the matched substrings.
findall with capture
import re
line = 'There is a phone number 12345 in this row and another 42 number'
match = re.search(r'\w+ \d+', line)
if match:
print(match.group(0)) # number 12345
match = re.search(r'\w+ (\d+)', line)
if match:
print(match.group(0)) # number 12345
print(match.group(1)) # 12345
matches = re.findall(r'\w+ \d+', line)
print(matches) # ['number 12345', 'another 42']
matches = re.findall(r'\w+ (\d+)', line)
print(matches) # ['12345', '42']
findall with capture more than one
import re
line = 'There is a phone number 12345 in this row and another 42 number'
match = re.search(r'(\w+) (\d+)', line)
if match:
print(match.group(1)) # number
print(match.group(2)) # 12345
matches = re.findall(r'(\w+) (\d+)', line)
print(matches) # [('number', '12345'), ('another', '42')]
If there are multiple capture groups then The returned list will consist of tuples.
Any Character
. matches any one character except newline.
For example: #.#
import re
strings = [
'abc',
'text: #q#',
'str: #a#',
'text #b# more text',
'#a and this? #c#',
'#a and this? # c#',
'#@#',
'#.#',
'# #',
'##'
'###'
]
for s in strings:
print('str: ', s)
match = re.search(r'#.#', s)
if match:
print('match:', match.group(0))
If re.DOTALL is given newline will be also matched.
Match dot
. \
import re
cases = [
"hello!",
"hello world.",
"hello. world",
".",
]
for case in cases:
print(case)
match = re.search(r'.', case) # Match any character
if match:
print(match.group(0))
print("----")
for case in cases:
print(case)
match = re.search(r'\.', case) # Match a dot
if match:
print(match.group(0))
print("----")
for case in cases:
print(case)
match = re.search(r'[.]', case) # Match a dot
if match:
print(match.group(0))
Character classes
[]
We would like to match any string that has any of the #a#, #b#, #c#, #d#, #e#, #f#, #@# or #.#
import re
strings = [
'abc',
'text: #q#',
'str: #a#',
'text #b# more text',
'#ab#',
'#@#',
'#.#',
'# #',
'##'
'###'
]
for s in strings:
print('str: ', s)
match = re.search(r'#[abcdef@.]#', s)
if match:
print('match:', match.group(0))
r'#[abcdef@.]#'
r'#[a-f@.]#'
Common characer classes
\d \w \s
-
\d digit:
[0-9]
or Unicode Characters in the 'Number, Decimal Digit' Category -
\w word character
[a-zA-Z0-9_]
(digits, letters, underscore) or see the Unicode set of digits and letters -
\s white space:
[\f\t\n\r\v ]
form-feed, tab, newline, carriage return, vertical-tab, and SPACE -
Use stand alone: \d or as part of a larger character class: [abc\d]
Negated character class
\D \W \S
- [^abc] matches any one character that is not 'a', not 'b' and not 'c'.
- \D not digit [^\d]
- \W not word character [^\w]
- \S not white space [^\s]
Character classes summary
a[bc]a # aba, aca
a[2#=x?.]a # a2a, a#a, a=a, axa, a?a, a.a
# inside the character class most of the spec characters lose their
# special meaning BUT there are some new special characters
a[2-8]a # is the same as /a[2345678]a/
a[2-]a # a2a, a-a - has no special meaning at the ends
a[-8]a # a8a, a-a
a[6-C]a # a6a, a7a ... aCa
# characters from the ASCII table: 6789:;<=>?@ABC
a[C-6]a # Error: "bad character range"
a[^xa]a # "aba", "aca" but not "aaa", "axa" what about "aa" ?
# ^ as the first character in a character class means
# a character that is not in the list
a[a^x]a # aaa, a^a, axa
Character classes and Unicode characters
import re
text = "👷👸👹👺👻✍👼👽👾👿💀💁💂"
print(text)
#print(chr(128120))
#print(0x1f000)
match = re.search(r"[\U0001f000-\U00020000]+", text)
if match:
print(match.group(0))
for emoji in text:
print(emoji, ord(emoji), "{:x}".format(ord(emoji)))
match = re.search(r"[👷-💂]*", text)
print(match.group(0))
Character classes for Hebrew text
import re
text = "שלום כיתה א"
print(text) # שלום כיתה א
print(ord(text[-1])) # 1488
print(text[-1]) # א
match = re.search(r"[א-ת]", text)
print(match.group(0)) # ש
match = re.search(r"[א-ת]+", text)
print(match.group(0)) # שלום
match = re.search(r"[ א-ת]*", text)
print(match.group(0)) # שלום כיתה א
match = re.search(r'[\u05d0-\u05eb]+', text)
print(match.group(0)) # שלום
# Hebrew has 22 letters, 5 of them have a different version at the end of the word
# A total of 27 letters
for ix in range(1488, 1488+27):
print(f"{ix} {chr(ix)}")
# 1488 א
# 1489 ב
# 1490 ג
# 1491 ד
# 1492 ה
# 1493 ו
# 1494 ז
# 1495 ח
# 1496 ט
# 1497 י
# 1498 ך
# 1499 כ
# 1500 ל
# 1501 ם
# 1502 מ
# 1503 ן
# 1504 נ
# 1505 ס
# 1506 ע
# 1507 ף
# 1508 פ
# 1509 ץ
# 1510 צ
# 1511 ק
# 1512 ר
# 1513 ש
# 1514 ת
Match digits
import re
values = [
'2',
'٣', # Arabic 3
'½', # unicode 1/2
'②', # unicode circled 2
'߄', # NKO 4 (a writing system for the Manding languages of West Africa)
'६', # Devanagari aka. Nagari (Indian)
'_', # underscrore
'-', # dash
'a', # Latin a
'á', # Hungarian
'א', # Hebrew aleph
]
for val in values:
print(val)
match = re.search(r'\d', val)
if match:
print('Match ', match.group(0))
match = re.search(r'\d', val, re.ASCII)
if match:
print('Match ASCII ', match.group(0))
Output:
2
Match 2
Match ASCII 2
٣
Match ٣
½
②
߄
Match ߄
६
Match ६
_
-
a
á
א
Word Characters
import re
values = [
'2',
'٣', # Arabic 3
'½', # unicode 1/2
'②', # unicode circled 2
'߄', # NKO 4 (a writing system for the Manding languages of West Africa)
'६', # Devanagari aka. Nagari (Indian)
'_', # underscrore
'-', # dash
'a', # Latin a
'á', # Hungarian
'א', # Hebrew aleph
]
for val in values:
print(val)
match = re.search(r'\w', val)
if match:
print('Match ', match.group(0))
match = re.search(r'\w', val, re.ASCII)
if match:
print('Match ASCII ', match.group(0))
Output:
2
Match 2
Match ASCII 2
٣
Match ٣
½
Match ½
②
Match ②
߄
Match ߄
६
Match ६
_
Match _
Match ASCII _
-
a
Match a
Match ASCII a
á
Match á
א
Match א
Exercise: add numbers
Given a file like this:
Foo:1
Foo:2
Foo:3
Foo:4
Foo:5
Foo:6
Foo:7
Foo:8
Bar:23
Foo:23
Foo:11
Foo:9
Bar:8
Zorg:7
- Add up the scores for each name and print the result.
Foo : 79
Bar : 31
Zorg : 7
- Make it work also on a file that looks like this:
# Let's start with Foo:1
Foo:1
Foo: 2
Foo :3
Foo : 4
Foo:5
Foo: 6
Foo :7
Foo : 8
# Let's start Bar with : 23
Bar:23
Foo: 23
Foo: 11
Foo : 9
Bar: 8
Zorg: 7
Solution: add numbers
import sys
def add_grades(filename):
grades = {}
with open(filename) as fh:
for line in fh:
line = line.rstrip("\n")
name, grade = line.split(":")
if name not in grades:
grades[name] = 0
grades[name] += int(grade)
for name in sorted(grades.keys(), key=lambda name: grades[name], reverse=True):
print(f"{name:6}:{grades[name]:-3}")
if __name__ == '__main__':
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} FILENAME")
filename = sys.argv[1]
add_grades(filename)
Solution: add numbers
import sys
def add_grades(filename):
grades = {}
with open(filename) as fh:
for line in fh:
line = line.rstrip("\n")
line = line.strip()
if line.startswith("#"):
continue
if line == '':
continue
name, grade = line.split(":")
name = name.strip()
if name not in grades:
grades[name] = 0
grades[name] += int(grade)
for name in sorted(grades.keys(), key=lambda name: grades[name], reverse=True):
print(f"{name:6}:{grades[name]:-3}")
if __name__ == '__main__':
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} FILENAME")
filename = sys.argv[1]
add_grades(filename)
Solution: add numbers
import re
import sys
def add_grades(filename):
grades = {}
with open(filename) as fh:
for line in fh:
if re.search(r'^\s*(#.*)?$', line):
continue
match = re.search(r'^\s*(\w+)\s*:\s*(\d+)\s*$', line)
if match:
name = match.group(1)
value = int(match.group(2))
else:
raise Exception(f"Invalid row: '{line}'")
if name not in grades:
grades[name] = 0
grades[name] += value
for name in sorted(grades.keys(), key=lambda name: grades[name], reverse=True):
print(f"{name:6}:{grades[name]:-3}")
if __name__ == '__main__':
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} FILENAME")
filename = sys.argv[1]
add_grades(filename)
Optional character
- ?
Match the word color or the word colour
Regex: r'colou?r'
Input: color
Input: colour
Input: colouur
Regex match 0 or more (the * quantifier)
Any line with two - -es with anything in between.
Regex: r'-.*-'
Input: "ab"
Input: "ab - cde"
Input: "ab - qqqrq -"
Input: "ab -- cde"
Input: "--"
Quantifiers
- ?
-
-
Quantifiers apply to the thing immediately to the left of them.
In this case it is the single character x
to the left of the quantifier, but later we'll see it can apply to a character-class or to a sub-expression enclosed in parentheses as well.
Whatever is located immediately to the left of the quantifier.
r'ax*a' # aa, axa, axxa, axxxa, ...
r'ax+a' # axa, axxa, axxxa, ...
r'ax?a' # aa, axa
r'ax{2,4}a' # axxa, axxxa, axxxxa
r'ax{3,}a' # axxxa, axxxxa, ...
r'ax{17}a' # axxxxxxxxxxxxxxxxxa
| * | 0- | | + | 1- | | ? | 0-1 | | {n,m} | n-m | | {n,} | n- | | {n} | n |
Quantifiers limit
import re
strings = (
"axxxa",
"axxxxa",
"axxxxxa",
)
for text in strings:
match = re.search(r'ax{4}', text)
if match:
print(f"Match {text}")
print(match.group(0))
else:
print("NOT Match")
Quantifiers on character classes
import re
strings = (
"-a-",
"-b-",
"-x-",
"-aa-",
"-ab-",
"--",
)
for line in strings:
match = re.search(r'-[abc]-', line)
if match:
print(line)
print('=========================')
for line in strings:
match = re.search(r'-[abc]+-', line)
if match:
print(line)
print('=========================')
for line in strings:
match = re.search(r'-[abc]*-', line)
if match:
print(line)
Greedy quantifiers
import re
match = re.search(r'xa*', 'xaaab')
print(match.group(0))
match = re.search(r'xa*', 'xabxaab')
print(match.group(0))
match = re.search(r'a*', 'xabxaab')
print(match.group(0))
match = re.search(r'a*', 'aaaxabxaab')
print(match.group(0))
They match 'xaaa', 'xa' and '' respectively.
Minimal quantifiers
import re
match = re.search(r'a.*b', 'axbzb')
print(match.group(0))
match = re.search(r'a.*?b', 'axbzb')
print(match.group(0))
match = re.search(r'a.*b', 'axy121413413bq')
print(match.group(0))
match = re.search(r'a.*?b', 'axyb121413413q')
print(match.group(0))
Anchors
-
\A
-
\Z
-
^
-
$
-
\A matches the beginning of the string
-
\Z matches the end of the string
-
^ matches the beginning of the row (see also re.MULTILINE)
-
$ matches the end of the row but will accept a trailing newline (see also re.MULTILINE)
import re
lines = [
"text with cat in the middle",
"cat with dog",
"dog with cat",
]
for line in lines:
if re.search(r'cat', line):
print(line)
print("---")
for line in lines:
if re.search(r'^cat', line):
print(line)
print("---")
for line in lines:
if re.search(r'\Acat', line):
print(line)
print("---")
for line in lines:
if re.search(r'cat$', line):
print(line)
print("---")
for line in lines:
if re.search(r'cat\Z', line):
print(line)
Output:
text with cat in the middle
cat with dog
dog with cat
---
cat with dog
---
cat with dog
---
dog with cat
---
dog with cat
Anchors with mulitline
import re
text = """
text with cat in the middle
cat with dog
dog with cat"""
if re.search(r'dog', text):
print(text)
print("---")
if re.search(r'^dog', text):
print('Carret dog')
print("---")
if re.search(r'\Adog', text):
print('A dog')
print("---")
if re.search(r'dog$', text):
print('$ dog')
print("---")
if re.search(r'dog\Z', text):
print('Z dog')
print("-----------------")
if re.search(r'^dog', text, re.MULTILINE):
print('^ dog')
print("---")
if re.search(r'\Adog', text, re.MULTILINE):
print('A dog')
print("---")
if re.search(r'dog$', text, re.MULTILINE):
print('$ dog')
print("---")
if re.search(r'dog\Z', text, re.MULTILINE):
print('Z dog')
Anchors on both end
import re
strings = [
"123",
"hello 456 world",
"hello world",
]
for line in strings:
if re.search(r'\d+', line):
print(line)
print('---')
for line in strings:
if re.search(r'^\d+$', line):
print(line)
print('---')
for line in strings:
if re.search(r'\A\d+\Z', line):
print(line)
Output:
123
hello 456 world
---
123
---
123
Match ISBN numbers
import re
strings = [
'99921-58-10-7',
'9971-5-0210-0',
'960-425-059-0',
'80-902734-1-6',
'85-359-0277-5',
'1-84356-028-3',
'0-684-84328-5',
'0-8044-2957-X',
'0-85131-041-9',
'0-943396-04-2',
'0-9752298-0-X',
'0-975229-1-X',
'0-9752298-10-X',
'0-9752298-0-Y',
'910975229-0-X',
'-------------',
'0000000000000',
'3-3-3-X',
]
for isbn in strings:
print(isbn)
if (re.search(r'^[0-9X-]{13}$', isbn)):
print("match 1")
if (len(isbn) == 13 and re.search(r'^[0-9]{1,5}-[0-9]{1,7}-[0-9]{1,5}-[0-9X]$', isbn)):
print("match 2")
Matching a section
import re
text = "This is <a string> with some <sections> marks."
m = re.search(r'<.*>', text)
if m:
print(m.group(0))
Matching a section - minimal
import re
text = "This is <a string> with some <sections> marks."
m = re.search(r'<.*?>', text)
if m:
print(m.group(0))
Matching a section negated character class
import re
text = "This is <a string> with some <sections> marks."
m = re.search(r'<[^>]*>', text)
if m:
print(m.group(0))
Two regex with logical or
All the rows with either 'apple pie' or 'banana pie' in them.
import re
strings = [
'apple pie',
'banana pie',
'apple'
]
for s in strings:
#print(s)
match1 = re.search(r'apple pie', s)
match2 = re.search(r'banana pie', s)
if match1 or match2:
print('Matched in', s)
Output:
Matched in apple pie
Matched in banana pie
Alternatives
- |
Alternatives
import re
strings = [
'apple pie',
'banana pie',
'apple'
]
for line in strings:
match = re.search(r'apple pie|banana pie', line)
if match:
print('Matched in', line)
Output:
Matched in apple pie
Matched in banana pie
Grouping and Alternatives
- ()
Move the common part in one place and limit the alternation to the part within the parentheses.
import re
strings = [
'apple pie',
'banana pie',
'apple'
]
for line in strings:
match = re.search(r'(apple|banana) pie', line)
if match:
print('Matched in', line)
Output:
Matched in apple pie
Matched in banana pie
Internal variables
- \1
- \2
- \3
- \4
import re
strings = [
'banana',
'apple',
'infinite loop',
]
for line in strings:
match = re.search(r'(.)\1', line)
if match:
print(match.group(0), 'matched in', line)
print(match.group(1))
Output:
pp matched in apple
p
oo matched in infinite loop
o
More internal variables
- \1
import re
line = "one 123 and two 123 and oxxo 23"
match = re.search(r"(.)(.)\2\1", line)
if match:
print(match.group(1)) # o
print(match.group(2)) # x
match = re.search(r"(\d\d).*\1", line)
if match:
print(match.group(1)) # 12
match = re.search(r"(\d\d).*\1.*\1", line)
if match:
print(match.group(1)) # 23
match = re.search(r"(\d\d).*\1{2,3}", "45 afjh 4545 kjdhfkh")
if match:
print(match.group(1)) # 45
# (.{5}).*\1
Regex DNA
- DNA is built from G, A, T, C
- Let's create a random DNA sequence
- Then find the longest repeated sequence in it
import re
import random
chars = ['G', 'A', 'T', 'C']
dna = 'AT'
for i in range(100):
dna += random.choice(chars)
print(dna)
# finds the first one, not necessarily the longest one
match = re.search(r"([GATC]+).*\1", dna)
if match:
print(match.group(1))
'''
Generating regexes:
([GATC]{1}).*\1
([GATC]{2}).*\1
([GATC]{3}).*\1
([GATC]{4}).*\1
...
'''
length = 1
result = ''
while True:
regex = r'([GATC]{' + str(length) + r'}).*\1'
#print(regex)
m = re.search(regex, dna)
if m:
result = m.group(1)
length += 1
else:
break
print(result)
print(len(result))
Output:
ATTTTATTGAGTCCTCTCGTGTGGTGTGATTGGGTGCAATTACCCCAAGGGCTCAAGTAATTCCCACATGATGATCAATGAGAGACCTGAATTAGCCATGCA
ATT
GTGTG
5
Regex IGNORECASE
- IGNORECASE
- I
import re
s = 'Python'
if (re.search('python', s)):
print('python matched')
if (re.search('python', s, re.IGNORECASE)):
print('python matched with IGNORECASE')
DOTALL S (single line)
- .
- DOTALL
- S
if re.DOTALL is given, . will match any character. Including newlines.
import re
line = 'Before <div>content</div> After'
text = '''
Before
<div>
content
</div>
After
'''
match = re.search(r'<div>.*</div>', line)
if match:
print(f"line '{match.group(0)}'");
match = re.search(r'<div>.*</div>', text)
if match:
print(f"text '{match.group(0)}'");
print('-' * 10)
match = re.search(r'<div>.*</div>', line, re.DOTALL)
if match:
print(f"line '{match.group(0)}'");
match = re.search(r'<div>.*</div>', text, re.DOTALL)
if match:
print(f"text '{match.group(0)}'");
MULTILINE M
- ^
- $
- MULTILINE
- M
if re.MULTILNE is given, ^ will match beginning of line and $ will match end of line
import re
line = 'Start blabla End'
text = '''
prefix
Start
blabla
End
postfix
'''
regex = r'^Start[\d\D]*End$'
m = re.search(regex, line)
if (m):
print('line')
m = re.search(regex, text)
if (m):
print('text')
print('-' * 10)
m = re.search(regex, line, re.MULTILINE)
if (m):
print('line')
m = re.search(regex, text, re.MULTILINE)
if (m):
print('text')
line
----------
line
text
Combine Regex flags
-
|
-
Use the bitwise or for that.
re.MULTILINE | re.DOTALL
Regex VERBOSE X
- VERBOSE
- X
import re
email = "foo@bar.com"
m = re.search(r'\w[\w.-]*\@([\w-]+\.)+(com|net|org|uk|hu|il)', email)
if (m):
print('match 1')
# To make the regex more readable we can break it into rows and add comments:
m = re.search(r'''
\w[\w.-]* # username
\@
([\w-]+\.)+ # domain
(com|net|org|uk|hu|il) # gTLD
''', email, re.VERBOSE)
if (m):
print('match 2')
# Improvement to make the code *after* the regex more readable using named captures
m = re.search(r'(?P<username>\w[\w.-]*)\@(?P<domain>[\w-]+\.)+(?P<gtld>com|net|org|uk|hu|il)', email)
# Both, use named captures and also break it up to rows.
m = re.search(r'''
(?P<username>\w[\w.-]*)
\@
(?P<domain>[\w-]+\.)+
(?P<gtld>com|net|org|uk|hu|il) # I only handle a few because this is just an example
''', email, re.VERBOSE)
Substitution
import re
line = "abc123def"
print(re.sub(r'\d+', ' ', line)) # "abc def"
print(line) # "abc123def"
print(re.sub(r'x', ' y', line)) # "abc123def"
print(line) # "abc123def"
print(re.sub(r'([a-z]+)(\d+)([a-z]+)', r'\3\2\1', line)) # "def123abc"
print(re.sub(r'''
([a-z]+) # letters
(\d+) # digits
([a-z]+) # more letters
''', r'\3\2\1', line, flags=re.VERBOSE)) # "def123abc"
print(re.sub(r'...', 'x', line)) # "xxx"
print(re.sub(r'...', 'x', line, count=1)) # "x123def"
print(re.sub(r'(.)(.)', r'\2\1', line)) # "ba1c32edf"
print(re.sub(r'(.)(.)', r'\2\1', line, count=2)) # "ba1c23def"
Substitution and MULTILINE - remove leading spaces
import re
text = """ First row
Second row
Third row
"""
print(text)
print('-----')
print(re.sub(r'^\s+', '', text))
print('-----')
print(re.sub(r'\A\s+', '', text, flags=re.MULTILINE))
print('-----')
print(re.sub(r'^\s+', '', text, flags=re.MULTILINE))
Output:
First row
Second row
Third row
-----
First row
Second row
Third row
-----
First row
Second row
Third row
-----
First row
Second row
Third row
findall capture
If there are parentheses in the regex, it will return tuples of the matches
import re
line = 'There is a phone number 83795 in this row and another 42 number'
print(line)
search = re.search(r'(\d)(\d)', line)
if search:
print(search.group(1)) # 8
print(search.group(2)) # 3
matches = re.findall(r'(\d)(\d)', line)
if matches:
print(matches) # [('8', '3'), ('7', '9'), ('4', '2')]
matches = re.findall(r'(\d)\D*', line)
if matches:
print(matches) # [('8', '3', '7', '9', '5', '4', '2')]
matches = re.findall(r'(\d)\D*(\d?)', line)
print(matches) # [('8', '3'), ('7', '9'), ('5', '4'), ('2', '')]
matches = re.findall(r'(\d).*?(\d)', line)
print(matches) # [('8', '3'), ('7', '9'), ('5', '4')]
matches = re.findall(r'(\d+)\D+(\d+)', line)
print(matches) # [('83795', '42')]
matches = re.findall(r'(\d+).*?(\d+)', line)
print(matches) # [('83795', '42')]
matches = re.findall(r'\d', line)
print(matches) # ['8', '3', '7', '9', '5', '4', '2']
Fixing dates
In the input we get dates like this 2010-7-5 but we would like to make sure we have two digits for both days and months: 2010-07-05
import re
def test_date(function):
dates = {
'2010-7-5' : '2010-07-05',
'2010-11-5' : '2010-11-05',
'2010-07-5' : '2010-07-05',
'2010-07-05' : '2010-07-05',
'2010-7-15' : '2010-07-15',
}
failures = 0
for original in sorted(dates.keys()):
result = function(original)
if result != dates[original]:
failures += 1
print(f" old: {original}")
print(f" new: {result}")
print(f" expected: {dates[original]}")
print("")
if failures == 0:
print("Everything looks good")
else:
exit(failures)
Fixing dates - 1
from date import test_date
import re
def fix_date1(date):
return re.sub(r'-(\d)', r'-0\1', date)
test_date(fix_date1)
Output:
old: 2010-07-05
new: 2010-007-005
expected: 2010-07-05
old: 2010-07-5
new: 2010-007-05
expected: 2010-07-05
old: 2010-11-5
new: 2010-011-05
expected: 2010-11-05
old: 2010-7-15
new: 2010-07-015
expected: 2010-07-15
Fixing dates - 2
from date import test_date
import re
def fix_date2(date):
return re.sub(r'-(\d)-', r'-0\1', date)
test_date(fix_date2)
Output:
old: 2010-07-5
new: 2010-07-5
expected: 2010-07-05
old: 2010-11-5
new: 2010-11-5
expected: 2010-11-05
old: 2010-7-15
new: 2010-0715
expected: 2010-07-15
old: 2010-7-5
new: 2010-075
expected: 2010-07-05
Fixing dates - 3
from date import test_date
import re
def fix_date3(date):
return re.sub(r'-(\d)(-|$)', r'-0\1\2', date)
test_date(fix_date3)
Output:
old: 2010-7-5
new: 2010-07-5
expected: 2010-07-05
Fixing dates - 4
from date import test_date
import re
def fix_date4(date):
return re.sub(r'-(\d)\b', r'-0\1', date)
test_date(fix_date4)
Output:
Everything looks good
Anchor edge of word
-
\b
-
\b beginning of word or end of word
import re
text = 'x a xyb x-c qwer_ ut!@y'
print(re.findall(r'.\b.', text))
print(re.findall(r'\b.', 'a b '))
print(re.findall(r'.\b', 'a b '))
Output:
['x ', 'a ', 'b ', 'x-', 'c ', '_ ', ' u', 't!', '@y']
['a', ' ', 'b', ' ']
['a', ' ', 'b']
Double numbers
We have a string that has some numbers in it. We would like to double the numbers.
In the first example we can see a relatively simple way of doubling the numbers. We captuter a number using the (\d+)
expression
that will save the current number in \1
and then we include it twice: \1\1
. This will convert a number like 1 to 11.
This is nice, but probably not what we wanted. We wanted to convert 1 to 2 and 34 to 68.
We can't do that with plain regular expressions and substitutions as that is all string-based. The plain substitution can only move around characters, but it cannot do any complex operations on the and thus cannot compute anything.
However, if the substitution part is a function then Python will call that function passing in the match object and whatever the function returns will
be the replacement string. This function can be a regular function defined with def
or a lambda
expression.
In the second example we see the solution with lambda
-expression.
The 3rd examples is the same solution but in a very step-by-step way with lots of temporary variables. This will hopefully help understand the
lambda
-expression in the 2nd example.
import re
text = "This is 1 string with 3 numbers: 34"
new_text = re.sub(r'(\d+)', r'\1\1', text)
print(new_text) # This is 11 string with 33 numbers: 3434
double_numbers = re.sub(r'(\d+)', lambda match: str(2 * int(match.group(0))), text)
print(double_numbers) # This is 2 string with 6 numbers: 68
# The same but in a function
def double(match):
matched_number_as_str = match.group(0)
number = int(matched_number_as_str)
doubled_number = 2 * number
doubled_number_as_str = str(doubled_number)
return doubled_number_as_str
double_numbers = re.sub(r'(\d+)', double, text)
print(double_numbers) # This is 2 string with 6 numbers: 68
Remove spaces
line = " ab cd "
res = line.lstrip(" ")
print(f"'{res}'") # 'ab cd '
res = line.rstrip(" ")
print(f"'{res}'") # ' ab cd'
res = line.strip(" ")
print(f"'{res}'") # 'ab cd'
res = line.replace(" ", "")
print(f"'{res}'") # 'abcd'
Replace string in Assembly code
At a company we had some Assembly code that looks like the text file. In case you don't know, Assembly is a very low level programming language. Here we have some sample code that uses variables like A and registers like R1, R2, and R3.
We had this code, but because the hardware changed we had to make changes to the code and rename the registies R1 to be R2, R2 to be R3, and R3 to be R1. We cannot just simple do these steps one after the other becasue once we renamed R1 to be R2 we won't know which of the R2-s do we need to rename and which are new. So someone came up with the idea to use R4 as a temporary name and start by renaming R1 to R4, R3 to R1, R2 to R3, and finally the temporary R4 to R2.
As you could see in our solution coed.
It worked all very smoothly till we turned on the device that immediately emitted smoke. It did not pass the "smoke test".
As it turns out in the original text there were also R4 registers that we have not noticed and they were all renamed to be R2.
The first idea to improve our converter program was to use some other temporary string that for sure cannot be in the code, such as QQRQ, but then we arrived to the conclusion that there are better ways to solve this.
mv A, R3
mv R2, B
mv R1, R3
mv B1, R4
add A, R1
add B, R1
add R1, R2
add R3, R3
add R21, X
add R12, Y
mv X, R2
import sys
import re
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} FILENAME")
filename = sys.argv[1]
with open(filename) as fh:
code = fh.read()
code = re.sub(r'R1', 'R4', code)
code = re.sub(r'R3', 'R1', code)
code = re.sub(r'R2', 'R3', code)
code = re.sub(r'R4', 'R2', code)
print(code)
Replace string in Assembly code - using mapping dict
The first imprvement was to create a dictionary with the mapping from old string to new string and then have a regex that will match exactly the 3 possible original string. In the substitute part we'll have to use a function as we need the current matching object to access the current match.
The function can be either a lambda
-expression as in the first solution or a fully defined function
as in the seconde solution that I added only to make it easier to understand the first solution.
This is a nice and working solution, but it has two issues.
In the regex used a character class because we assumed that there are only going to be on-digit registries. If you look at the original Assembly code you can see there are also R12 and R21.
In addition we now have data duplication. If we change the mapping adding a new original string or removing one, we'll also have to remember to update the regex. It is not DRY.
import sys
import re
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} FILENAME")
filename = sys.argv[1]
with open(filename) as fh:
code = fh.read()
mapping = {
'R1' : 'R2',
'R2' : 'R3',
'R3' : 'R1',
}
code = re.sub(r'\b(R[123])\b', lambda match: mapping[match.group(1)], code)
print(code)
# The same but now with a named function for clarity
def replace(match):
original = match.group(1)
return mapping[original]
code = re.sub(r'\b(R[123])\b', replace, code)
print(code)
Replace string in Assembly code - using alternatives
We can solve the first issue by changing the regex. Instead of using a character class, we use alternatives (vertical line, aka. pipe) and fully write down the original strings.
The rest of the code is the same and the second issue is not solved yet, we still have to make sure the keys of the dictionary and the values in the regex are the same.
However this solution makes it easier to solve the second issue as well.
import sys
import re
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} FILENAME")
filename = sys.argv[1]
with open(filename) as fh:
code = fh.read()
mapping = {
'R1' : 'R2',
'R2' : 'R3',
'R3' : 'R1',
'R12' : 'R21',
'R21' : 'R12',
}
code = re.sub(r'\b(R1|R2|R3|R12)\b', lambda match: mapping[match.group(1)], code)
print(code)
Replace string in Assembly code - generate regex
In this solution we generate the regex from the keys of the mapping dictionary.
Once we have this we also opened other opportunities for improvement. Now that all the replacement mapping comes from a regex we have separated the "data" from the "code". We can now decide to read in the mapping from an Excel file (for example). That way we can hand over the mapping creation to someone who does not know Python. Our code will take that file, read the mapping from the spreadsheet, create the mapping dictionary, create the regex and do the work.
import sys
import re
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} FILENAME")
filename = sys.argv[1]
with open(filename) as fh:
code = fh.read()
mapping = {
'R1' : 'R2',
'R2' : 'R3',
'R3' : 'R1',
'R12' : 'R21',
'R21' : 'R12',
}
regex = r'\b(' + '|'.join(mapping.keys()) + r')\b'
print(regex)
code = re.sub(regex, lambda match: mapping[match.group(1)], code)
print(code)
Full example of previous
import sys
import os
import time
import re
if len(sys.argv) <= 1:
exit(f"Usage: {sys.argv[0]} INFILEs")
conversion = {
'R1' : 'R2',
'R2' : 'R3',
'R3' : 'R1',
'R12' : 'R21',
'R21' : 'R12',
}
#print(conversion)
def replace(mapping, files):
regex = r'\b(' + '|'.join(mapping.keys()) + r')\b'
#print(regex)
ts = time.time()
for filename in files:
with open(filename) as fh:
data = fh.read()
data = re.sub(regex, lambda match: mapping[match.group(1)], data)
os.rename(filename, f"{filename}.{ts}") # backup with current timestamp
with open(filename, 'w') as fh:
fh.write(data)
replace(conversion, sys.argv[1:]);
Split with regex
fname = Foo
lname = Bar
email=foo@bar.com
import sys
import re
# data: field_value_pairs.txt
if len(sys.argv) != 2:
exit(f"Usage: {sys.argv[0]} filename")
filename = sys.argv[1]
with open(filename) as fh:
for line in fh:
line = line.rstrip("\n")
field, value = re.split(r'\s*=\s*', line)
print(f"{value}={field}")
Foo=fname
Bar=lname
foo@bar.com=email
Exercises: Regexes part 1
Pick up a file with some text in it. Write a script (one for each item) that prints out every line from the file that matches the requirement. You can use the script at the end of the page as a starting point but you will have to change it!
- has a 'q'
- starts with a 'q'
- has 'th'
- has an 'q' or a 'Q'
- has a '*' in it
- starts with an 'q' or an 'Q'
- has both 'a' and 'e' in it
- has an 'a' and somewhere later an 'e'
- does not have an 'a'
- does not have an 'a' nor 'e'
- has an 'a' but not 'e'
- has at least 2 consecutive vowels (a,e,i,o,u) like in the word "bear"
- has at least 3 vowels
- has at least 6 characters
- has at exactly 6 characters
- all the words with either 'Bar' or 'Baz' in them
- all the rows with either 'apple pie' or 'banana pie' in them
- for each row print if it was apple or banana pie?
- Bonus: Print if the same word appears twice in the same line
- Bonus: has a double character (e.g. 'oo')
import sys
import re
if len(sys.argv) != 2:
print("Usage:", sys.argv[0], "FILE")
exit()
filename = sys.argv[1]
with open(filename, 'r') as fh:
for line in fh:
print(line, end=" ")
match = re.search(r'REGEX1', line)
if match:
print(" Matching 1", match.group(0))
match = re.search(r'REGEX2', line)
if match:
print(" Matching 2", match.group(0))
Exercise: Regexes part 2
Write functions that returns true if the given value is a
- Hexadecimal number
- Octal number
- Binary number
Write a function that given a string it return true if the string is a number. As there might be several definitions of what is the number create several solutions one for each definition:
- Non negative integer.
- Integer. (Will you also allow + in front of the number or only - ?
- Real number. (Do you allow .3 ? What about 2. ?
- In scientific notation. (something like this: 2.123e4 )
23
2.3
2.3.4
2.4e3
abc
Exercise: Sort SNMP numbers
Given a file with SNMP numbers (one number on every line) print them in sorted order comparing the first number of each SNMP number first. If they are equal then comparing the second number, etc...
input:
1.2.7.6
4.5.7.23
1.2.7
1.12.23
2.3.5.7.10.8.9
1.2.7.5
output:
1.2.7
1.2.7.5
1.2.7.6
1.12.23
2.3.5.7.10.8.9
4.5.7.23
Exercise: parse hours log file and create report
The log file looks like this
{% embed include file="src/examples/regex/timelog.log)
the report should look something like this:
09:20-11:00 Introduction
11:00-11:15 Exercises
11:15-11:35 Break
11:35-12:30 Numbers and strings
12:30-13:30 Lunch Break
13:30-14:10 Exercises
14:10-14:30 Solutions
14:30-14:40 Break
14:40-15:40 Lists
15:40-17:00 Exercises
17:00-17:30 Solutions
09:30-10:30 Lists and Tuples
10:30-10:50 Break
10:50-12:00 Exercises
12:00-12:30 Solutions
12:30-12:45 Dictionaries
12:45-14:15 Lunch Break
14:15-16:00 Exercises
16:00-16:15 Solutions
16:15-16:30 Break
16:30-17:00 Functions
17:00-17:30 Exercises
Break 65 minutes 6%
Dictionaries 15 minutes 1%
Exercises 340 minutes 35%
Functions 30 minutes 3%
Introduction 100 minutes 10%
Lists 60 minutes 6%
Lists and Tuples 60 minutes 6%
Lunch Break 150 minutes 15%
Numbers and strings 55 minutes 5%
Solutions 95 minutes 9%
Exercise: Parse ini file
An ini file has sections starting by the name of the section in square brackets and within each section there are key = value pairs with optional spaces around the "=" sign. The keys can only contain letters, numbers, underscore or dash. In addition there can be empty lines and lines starting with # which are comments.
Given a filename, generate a 2 dimensional hash and then print it out. Example ini file:
{% embed include file="src/examples/regex/inifile.ini)
If you print it, it should look like this (except of the nice formatting).
{
'alpha': {
'base': 'moon',
'ship': 'alpha 3'
},
'earth': {
'base': 'London',
'ship': 'x-wing'
}
}
Exercise: Replace Python
Write a script called replace_python.py that given a file will replace all occurrences of "Python" or "python" by Java, but will avoid replacing the word in Monty Python. It prints the resulting rows to the screen.
For example given the input:
Just a line without either of the languages.
Line with both Python and python
A line with Monty Python.
And a line with Monty Python and Python.
Will print:
Just a line without either of the languages.
Line with both Java and java
A line with Monty Python.
And a line with Monty Python and Java.
Exercise: Extract phone numbers
Given a text message fetch all the phone numbers:
Fetch numbers that look like 09-1234567
then also fetch +972-2-1234567
and maybe also 09-123-4567
This 123 is not a phone number.
Solution: Sort SNMP numbers
import sys
def process(filename):
snmps = []
with open(filename) as fh:
for row in fh:
snmps.append({
'orig': row.rstrip(),
})
#print(snmps)
max_number_of_parts = 0
max_number_of_digits = 0
for snmp in snmps:
snmp['split'] = snmp['orig'].split('.')
max_number_of_parts = max(max_number_of_parts, len(snmp['split']))
for part in snmp['split']:
max_number_of_digits = max(max_number_of_digits, len(part))
padding = "{:0" + str(max_number_of_digits) + "}"
#print(padding)
for snmp in snmps:
padded = []
padded_split = snmp['split'] + ['0'] * (max_number_of_parts - len(snmp['split']))
for part in padded_split:
padded.append(padding.format( int(part)))
snmp['padded'] = padded
snmp['joined'] = '.'.join(padded)
#print(snmps)
#print(max_number_of_parts)
#print(max_number_of_digits)
snmps.sort(key = lambda e: e['joined'])
sorted_snmps = []
for snmp in snmps:
sorted_snmps.append( snmp['orig'] )
for snmp in sorted_snmps:
print(snmp)
# get the max number of all the snmp parts
# make each snmp the same length
# pad each part to that length with leading 0s
if len(sys.argv) < 2:
exit("Usage: {} FILENAME".format(sys.argv[0]))
process(sys.argv[1])
Solution: parse hours log file and give report
import sys
if len(sys.argv) < 2:
exit("Usage: {} FILENAME".format(sys.argv[0]))
data = {}
def read_file(filename):
entries = []
with open(filename) as fh:
for row in fh:
row = row.rstrip("\n")
if row == '':
process_day(entries)
entries = []
continue
#print(row)
time, title = row.split(" ", 1)
#print(time)
#print(title)
#print('')
entries.append({
'start': time,
'title': title,
})
process_day(entries)
def process_day(entries):
for i in range(len(entries)-1):
start = entries[i]['start']
title = entries[i]['title']
end = entries[i+1]['start']
print("{}-{} {}".format(start, end, title))
# manual way to parse timestamp and calculate elapsed time
# as we have not learned to use the datetim module yet
start_hour, start_min = start.split(':')
end_hour, end_min = end.split(':')
start_in_min = 60*int(start_hour) + int(start_min)
end_in_min = 60*int(end_hour) + int(end_min)
elapsed_time = end_in_min - start_in_min
#print(elapsed_time)
if title not in data:
data[title] = 0
data[title] += elapsed_time
print('')
def print_summary():
total = 0
for val in data.values():
total += val
for key in sorted( data.keys() ):
print("{:20} {:4} minutes {:3}%".format(key, data[key], int(100 * data[key]/total)))
read_file( sys.argv[1] )
print_summary()
Solution: Processing INI file manually
# comment
# deep comment
outer = 42
[person]
fname = Foo
lname=Bar
phone = 123
[company]
name = Acme Corp.
phone = 456
import sys
import re
# Sample input data.ini
def parse():
if len(sys.argv) != 2:
exit("Usage: {} FILEAME".format(sys.argv[0]))
filename = sys.argv[1]
data = {}
# print("Dealing with " + filename)
with open(filename) as fh:
section = '__DEFAULT__'
for line in fh:
if re.match(r'^\s*(#.*)?$', line):
continue
match = re.match(r'^\[([^\]]+)\]\s*$', line)
if (match):
# print('Section "{}"'.format(m.group(1)))
section = match.group(1)
continue
match = re.match(r'^\s*(.+?)\s*=\s*(.*?)\s*$', line)
if match:
# print 'field :"{}" value: "{}"'.format(m.group(1), m.group(2))
if not data.get(section):
data[section] = {}
data[section][ match.group(1) ] = match.group(2)
return data
if __name__ == '__main__':
ini = parse()
print(ini)
Solution: Processing config file
- ConfigParse
[person]
fname = Foo
lname=Bar
phone = 123
# comment
# deep comment
[company]
name = Acme Corp.
phone = 456
import configparser
import sys
def parse():
if len(sys.argv) != 2:
print("Usage: " + sys.argv[0] + " FILEAME")
exit()
filename = sys.argv[1]
cp = configparser.RawConfigParser()
cp.read(filename)
return cp
ini = parse()
for section in ini.sections():
print(section)
for v in ini.items(section):
print(" {} = {}".format(v[0], v[1]))
Solution: Extract phone numbers
import re
filename = "phone.txt"
with open(filename) as fh:
for line in fh:
match = re.search(r'''\b
(
\d\d-\d{7}
|
\d\d\d-\d-\d{7}
|
\d\d-\d\d\d-\d\d\d\d
)\b''', line, re.VERBOSE)
if match:
print(match.group(1))
Regular Expressions Cheat sheet
| Expression | Meaning | | a | Just an 'a' character | | . | any character except new-line | | [bgh.] | one of the chars listed in the character class b,g,h or . | | [b-h] | The same as [bcdefgh] | | [a-z] | Lower case letters | | [b-] | The letter b or - | | [^bx] | Anything except b or x | | \w | Word characters: [a-zA-Z0-9_] | | \d | Digits: [0-9] | | \s | [\f\t\n\r ] form-feed, tab, newline, carriage return and SPACE | | \W | [^\w] | | \D | [^\d] | | \S | [^\s] | | a* | 0-infinite 'a' characters | | a+ | 1-infinite 'a' characters | | a? | 0-1 'a' characters | | a{n,m} | n-m 'a' characters | | ( ) | Grouping and capturing | | | | Alternation | | \1, \2 | Capture buffers | | ^ $ | Beginning and end of string anchors |
Fix bad JSON
{
subscriptions : [
{
name : "Foo Bar",
source_name : "pypi",
space names : [
"Foo", "Bar"
]
}
]
}
import re, json, os
json_file = os.path.join(
os.path.dirname(__file__),
'bad.json'
)
with open(json_file) as fh:
data = json.load(fh)
# ValueError: Expecting property name: line 2 column 4 (char 5)
import re, json, os
def fix(s):
return re.sub(r'(\s)([^:\s][^:]+[^:\s])(\s+:)', r'\1"\2"\3', s)
json_file = os.path.join(
os.path.dirname(__file__),
'bad.json'
)
with open(json_file) as fh:
bad_json_rows = fh.readlines()
json_str = ''.join(map(fix, bad_json_rows))
print(json_str)
data = json.loads(json_str)
print(data)
Fix very bad JSON
[
{
TID : "t-0_login_sucess"
Test :
[
{SetValue : { uname : "Zorg", pass : "Rules"} },
{DoAction : "login"},
{CheckResult: [0, LOGGED_IN]}
]
},
{ TID : "t-1_login_failure", Test : [ {SetValue :
{ uname : "11", pass : "im2happy78"} },
{DoAction : "login"}, {CheckResult: [-1000, LOGGED_OUT]} ] }
]
import re, json, os
json_file = os.path.join(
os.path.dirname(__file__),
'very_bad.json'
)
with open(json_file, 'r') as fh:
bad_json = fh.read()
#print(bad_json)
improved_json = re.sub(r'"\s*$', '",', bad_json, flags=re.MULTILINE)
#print(improved_json)
# good_json = re.sub(r'(?<!")(?P<word>[\w-]+)\b(?!")', '"\g<word>"',
# improved_json)
# good_json = re.sub(r'(?<[\{\s])(?P<word>[\w-]+)(?=[:\s])', '"\g<word>"',
# improved_json)
# good_json = re.sub(r'([\{\[\s])(?P<word>[\w-]+)([:,\]\s])', '\1"\g<word>"\3',
# improved_json)
good_json = re.sub(r'(?<=[\{\[\s])(?P<word>[\w-]+)(?=[:,\]\s])', '"\g<word>"',
improved_json)
#print(good_json)
# with open('out.js', 'w') as fh:
# fh.write(good_json)
data = json.loads(good_json)
print(data)
Raw string or escape
- \
- r
Let's try to check if a string contains a back-slash?
import re
txt = 'text with slash \ and more text'
print(txt) # text with slash \ and more text
# m0 = re.search('\', txt)
# SyntaxError: EOL while scanning string literal
# m0 = re.search('\\', txt)
# Exception: sre_constants.error: bogus escape (end of line)
# because the regex engine does not know what to do with a single \
m1 = re.search('\\\\', txt)
if m1:
print('m1') # m1
m2 = re.search(r'\\', txt)
if m2:
print('m2') # m2
Remove spaces regex
This is not necessary as we can use rstrip, lstrip, and replace.
import re
line = " ab cd "
res = re.sub(r'^\s+', '', line) # leading
print(f"'{res}'")
res = re.sub(r'\s+$', '', line) # trailing
print(f"'{res}'")
both ends:
re.sub(r'\s*(.*)\s*$', r'\1', line) # " abc " => "abc " because of the greediness
re.sub('^\s*(.*?)\s*$', '\1', line) # " abc " => "abc" minimal match
Regex Unicode
Python 3.8 required
print("\N{GREEK CAPITAL LETTER DELTA}")
print("\u05E9")
print("\u05DC")
print("\u05D5")
print("\u05DD")
print("\u262E")
print("\u1F426") # "bird"
print("\u05E9\u05DC\u05D5\u05DD \u262E")
Hello World!
Szia Világ!
!שלום עולם
import re
filename = "mixed.txt"
with open(filename) as fh:
lines = fh.readlines()
for line in lines:
if re.search('\N{IN HEBREW}', line):
print(line)
Anchors Other example
- \A
- \Z
- ^
- $
import re
strings = [
"123-XYZ-456",
"a 123-XYZ-456 b",
"a 123-XYZ-456",
"123-XYZ-456 b",
"123-XYZ-456\n",
]
regexes = [
r'\d{3}-\w+-\d{3}',
r'^\d{3}-\w+-\d{3}',
r'\d{3}-\w+-\d{3}$',
r'^\d{3}-\w+-\d{3}$',
r'^\d{3}-\w+-\d{3}\Z',
r'\A\d{3}-\w+-\d{3}\Z',
]
for r in regexes:
print(r)
for s in strings:
#print(r, s)
if (re.search(r, s)):
print(' ', s)
print('-' * 10)