[python] re regular expressions

re Regular expressions

This is a personal note about useful notations in python. I have not touched on the basics. It is limited to what I find useful.

There seems to be a lot to write about regular expressions, but I will update this page as needed when I remember or encounter new ways of using them.

When I was younger, I tried my best to learn various regular expression patterns, but now I look them up and use them only when I need them. It’s hard to learn them all and become a master of regular expressions.

github

The jupyter notebook format file on github is here .

google colaboratory

To run it in google colaboratory here 004/004_nb.ipynb)

environment

! sw_vers
ProductName: Mac OS X

ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G2022

Python -V

Python 3.7.3

Read re

Regular expressions usually require two strings: the string you want to parse and the regular expression pattern from which to parse it.

The reading is as follows

import re

re.compile()

There are two ways to do search and replace with regular expressions. One is to use re.compile() to pre-compile a regular expression pattern and use that object when needed. The other is to perform the compilation process when necessary and use the object at the same time. If you use the search pattern many times, it is better to use re.compile().

Here is an example of the two usage methods.

If you don’t want to use compile

obj = r'asdfghjkl'
target = r'gh'

ret = re.findall(target, obj)

if ret:
  print('result of findall : ', ret)

Result of findall : ['gh'].

Using compile

target = r'asdfghjkl'
pat = r'gh'

pat_target = re.compile(pat)

ret = pat_target.findall(pat)

if ret:
  print('result of findall : ', ret)

Result of findall : ['gh'].

It is better to add r to the original regular expression pattern by default, because it can represent characters that require backslashes.

Searching for strings

There are four ways to use the search function. For me personally, findall is the one I use most often.

re.match : Check if the beginning of the target string matches the regular expression.
re;.search : check if the target string matches the regular expression (even if it is not the beginning)
re.findall : Returns a list of the parts of the target string that match the regular expression.
re.finditer :Return the part of the target string that matches the regular expression as an iterator.

re.match

import re

pat = r'[a-z_]+'
target = 'this_is_1_apple.'

result_match = re.match(pat, target)

print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()

print('### group ###')
if result_match:
  print('group :', result_match.group())
  print('span :', result_match.span())
  print('start :', result_match.start())
  print('end :', result_match.end())
else:
  print('matcn None')

### string ###
pat : [a-z_]+
target : this_is_1_apple.

### group ###
group : this_is_
span : (0, 8)
start : 0
end : 8

re.search

pat = r'[0-9]+'
target = 'this_is_1_apple.'

result_search = re.search(pat, target)

print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()

print('### search ###')
if result_search:
  print('group :', result_search.group())
  print('span :', result_search.span())
  print('start :', result_search.start())
  print('end :', result_search.end())
else:
  print('search None')

### string ###
pat : [0-9]+
target : this_is_1_apple.

### search ### group : 1
group : 1
span : (8, 9)
start : 8
end : 9

pat = r'(abc(...)) *def)'


result_search = re.search(pat, target)

print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()

print('### search ###')
if result_search:
  print('group :', result_search.group())
  print('span :', result_search.span())
  print('start :', result_search.start())
  print('end :', result_search.end())
  print('groups :', result_search.groups())
else:
  print('search None')

### string ###
pat : (abc(...)) *def)
target : sssabcssabcssdefssssdefsssssssssssssssssssssssssssssssssssssssss

### search ###
group : abcsabcssdefsssdef
span : (3, 21)
start : 3
end : 21
groups : ('abcsabcssdefsssdef', 'sss')

re.findall

pat = r'aaat(. *)tb([a-z]*)b'
target = 'aaatestbbbcccbbbbb'

result_findall = re.findall(pat, target)

print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()

print('### findall ###')
print(re.findall(pat, target))

### string ###
pat : aaat(. *)tb([a-z]*)b
target : aaatestbbbcccbbbbb

### findall ###
[('es', 'bbcccbbbb')]

re.finditer

Return the matching iterator.

pat = r'aaat(. *)tb([a-z]*)b'
target = 'aaatestbbbcccbbbbb'

result_finditer = re.finditer(pat, target)

print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()

print('### finditer ###')
print(re.finditer(pat, target))

### string ###
pat : aaat(. *)tb([a-z]*)b
target : aaatestbbbcccbbbbb

### finditer ###
<callable_iterator object at 0x107d76198>

replace string

re.sub()

Replaces a string with a regular expression. Personally, I use this most often.

re.sub(regular expression pattern, replacement string, string to replace)

pat = r'0(8|9)0-[0-9]{4}-[0-9]{4}'
repl = '0X0-YYYY-ZZZZ'
obj = '080-1234-5678'

re.sub(pat, repl, obj)

'0X0-YYYY-ZZZ'

pat = r'0(8|9)0-[0-9]{4}-[0-9]{4}'
obj = """
080-1234-5678
090-8765-4321
"""

print(re.sub(pat,r'0X0-YYYY-ZZZZ', obj))

0X0-YYYY-ZZZZ
0X0-YYYY-ZZZZZ

backwards

pat = r'0(8|9)0-([0-9]{4})-([0-9]{4})'
obj = '080-1234-5678'

re.sub(pat,r'0\g<1>0-\3-\2', obj)

'080-5678-1234'

For consecutive and separated numbers.

pat = r'0(8|9)0-([0-9]{4})-([0-9]{4})'
obj = '080-1234-5678'

re.sub(pat,r'0\g<1>0-\3-\2', obj)

'080-5678-1234'

newline

pat = r'^0(8|9)0-[0-9]{4}-[0-9]{4}'
obj = """\
080-1234-5678
090-1234-4567
"""

re.sub(pat,'^0X0-YYYY-ZZZ', obj, flags=re.MULTILINE)

'^0X0-YYYY-ZZZ\n^0X0-YYYY-ZZZ\n'