re Regular expressions
This is a personal note about useful notations in python. I have not touched on the basics. It is limited to what I find useful.
There seems to be a lot to write about regular expressions, but I will update this page as needed when I remember or encounter new ways of using them.
When I was younger, I tried my best to learn various regular expression patterns, but now I look them up and use them only when I need them. It’s hard to learn them all and become a master of regular expressions.
github
- The jupyter notebook format file on github is here .
google colaboratory
- To run it in google colaboratory here 004/004_nb.ipynb)
environment
! sw_vers
ProductName: Mac OS X
ProductName: Mac OS X
ProductVersion: 10.14.6
BuildVersion: 18G2022
Python -V
Python 3.7.3
Read re
Regular expressions usually require two strings: the string you want to parse and the regular expression pattern from which to parse it.
The reading is as follows
import re
re.compile()
There are two ways to do search and replace with regular expressions. One is to use re.compile()
to pre-compile a regular expression pattern and use that object when needed. The other is to perform the compilation process when necessary and use the object at the same time. If you use the search pattern many times, it is better to use re.compile()
.
Here is an example of the two usage methods.
If you don’t want to use compile
obj = r'asdfghjkl'
target = r'gh'
ret = re.findall(target, obj)
if ret:
print('result of findall : ', ret)
Result of findall : ['gh'].
Using compile
target = r'asdfghjkl'
pat = r'gh'
pat_target = re.compile(pat)
ret = pat_target.findall(pat)
if ret:
print('result of findall : ', ret)
Result of findall : ['gh'].
It is better to add r to the original regular expression pattern by default, because it can represent characters that require backslashes.
Searching for strings
There are four ways to use the search function. For me personally, findall is the one I use most often.
- re.match : Check if the beginning of the target string matches the regular expression.
- re;.search : check if the target string matches the regular expression (even if it is not the beginning)
- re.findall : Returns a list of the parts of the target string that match the regular expression.
- re.finditer :Return the part of the target string that matches the regular expression as an iterator.
re.match
import re
pat = r'[a-z_]+'
target = 'this_is_1_apple.'
result_match = re.match(pat, target)
print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()
print('### group ###')
if result_match:
print('group :', result_match.group())
print('span :', result_match.span())
print('start :', result_match.start())
print('end :', result_match.end())
else:
print('matcn None')
### string ###
pat : [a-z_]+
target : this_is_1_apple.
### group ###
group : this_is_
span : (0, 8)
start : 0
end : 8
re.search
pat = r'[0-9]+'
target = 'this_is_1_apple.'
result_search = re.search(pat, target)
print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()
print('### search ###')
if result_search:
print('group :', result_search.group())
print('span :', result_search.span())
print('start :', result_search.start())
print('end :', result_search.end())
else:
print('search None')
### string ###
pat : [0-9]+
target : this_is_1_apple.
### search ### group : 1
group : 1
span : (8, 9)
start : 8
end : 9
pat = r'(abc(...)) *def)'
result_search = re.search(pat, target)
print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()
print('### search ###')
if result_search:
print('group :', result_search.group())
print('span :', result_search.span())
print('start :', result_search.start())
print('end :', result_search.end())
print('groups :', result_search.groups())
else:
print('search None')
### string ###
pat : (abc(...)) *def)
target : sssabcssabcssdefssssdefsssssssssssssssssssssssssssssssssssssssss
### search ###
group : abcsabcssdefsssdef
span : (3, 21)
start : 3
end : 21
groups : ('abcsabcssdefsssdef', 'sss')
re.findall
pat = r'aaat(. *)tb([a-z]*)b'
target = 'aaatestbbbcccbbbbb'
result_findall = re.findall(pat, target)
print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()
print('### findall ###')
print(re.findall(pat, target))
### string ###
pat : aaat(. *)tb([a-z]*)b
target : aaatestbbbcccbbbbb
### findall ###
[('es', 'bbcccbbbb')]
re.finditer
Return the matching iterator.
pat = r'aaat(. *)tb([a-z]*)b'
target = 'aaatestbbbcccbbbbb'
result_finditer = re.finditer(pat, target)
print("### string ###")
print('pat : ', pat)
print('target : ', target)
print()
print('### finditer ###')
print(re.finditer(pat, target))
### string ###
pat : aaat(. *)tb([a-z]*)b
target : aaatestbbbcccbbbbb
### finditer ###
<callable_iterator object at 0x107d76198>
replace string
re.sub()
Replaces a string with a regular expression. Personally, I use this most often.
re.sub(regular expression pattern, replacement string, string to replace)
pat = r'0(8|9)0-[0-9]{4}-[0-9]{4}'
repl = '0X0-YYYY-ZZZZ'
obj = '080-1234-5678'
re.sub(pat, repl, obj)
'0X0-YYYY-ZZZ'
pat = r'0(8|9)0-[0-9]{4}-[0-9]{4}'
obj = """
080-1234-5678
090-8765-4321
"""
print(re.sub(pat,r'0X0-YYYY-ZZZZ', obj))
0X0-YYYY-ZZZZ
0X0-YYYY-ZZZZZ
backwards
pat = r'0(8|9)0-([0-9]{4})-([0-9]{4})'
obj = '080-1234-5678'
re.sub(pat,r'0\g<1>0-\3-\2', obj)
'080-5678-1234'
For consecutive and separated numbers.
pat = r'0(8|9)0-([0-9]{4})-([0-9]{4})'
obj = '080-1234-5678'
re.sub(pat,r'0\g<1>0-\3-\2', obj)
'080-5678-1234'
newline
pat = r'^0(8|9)0-[0-9]{4}-[0-9]{4}'
obj = """\
080-1234-5678
090-1234-4567
"""
re.sub(pat,'^0X0-YYYY-ZZZ', obj, flags=re.MULTILINE)
'^0X0-YYYY-ZZZ\n^0X0-YYYY-ZZZ\n'