Python Reg Expressions
# Python Regular Expressions
## Python2.x Regular Expressions
A regular expression is a special character sequence that helps you conveniently check if a string matches a certain pattern.
Python has included the `re` module since version 1.5, which provides Perl-style regular expression patterns.
The `re` module gives Python the full functionality of regular expressions.
The `compile` function generates a regular expression object from a pattern string and optional flag parameters. This object has a series of methods for regular expression matching and replacement.
The `re` module also provides functions that are functionally identical to these methods, using a pattern string as their first parameter.
This chapter mainly introduces the commonly used regular expression processing functions in Python.
* * *
## re.match Function
`re.match` attempts to match a pattern from the beginning of a string. If the match is not successful at the starting position, `match()` returns `None`.
**Function Syntax:**
```python
re.match(pattern, string, flags=0)
**Function Parameter Description:**
| Parameter | Description |
| --- | --- |
| pattern | The regular expression pattern to match. |
| string | The string to be matched. |
| flags | Flag bits used to control the matching mode of the regular expression, such as: whether to distinguish case, multiline matching, etc. See: (#) |
If the match is successful, the `re.match` method returns a match object; otherwise, it returns `None`.
We can use the `group(num)` or `groups()` match object functions to get the matched expression.
| Match Object Method | Description |
| --- | --- |
| group(num=0) | The string of the entire matched expression. `group()` can take multiple group numbers at once, in which case it will return a tuple containing the values corresponding to those groups. |
| groups() | Returns a tuple containing all group strings, from 1 to the number of groups contained. |
## Example
```python
import re
print(re.match('www', 'example.com').span())
print(re.match('com', 'example.com'))
The output of the above example is:
(0, 3)
None
## Example
```python
import re
line = "Cats are smarter than dogs"
matchObj = re.match(r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
print"matchObj.group() : ", matchObj.group()
print"matchObj.group(1) : ", matchObj.group(1)
print"matchObj.group(2) : ", matchObj.group(2)
else:
print"No match!!"
The result of executing the above example is as follows:
matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
* * *
## re.search Method
`re.search` scans the entire string and returns the first successful match.
**Function Syntax:**
```python
re.search(pattern, string, flags=0)
**Function Parameter Description:**
| Parameter | Description |
| --- | --- |
| pattern | The regular expression pattern to match. |
| string | The string to be matched. |
| flags | Flag bits used to control the matching mode of the regular expression, such as: whether to distinguish case, multiline matching, etc. |
If the match is successful, the `re.search` method returns a match object; otherwise, it returns `None`.
We can use the `group(num)` or `groups()` match object functions to get the matched expression.
| Match Object Method | Description |
| --- | --- |
| group(num=0) | The string of the entire matched expression. `group()` can take multiple group numbers at once, in which case it will return a tuple containing the values corresponding to those groups. |
| groups() | Returns a tuple containing all group strings, from 1 to the number of groups contained. |
## Example
```python
import re
print(re.search('www', 'example.com').span())
print(re.search('com', 'example.com').span())
The output of the above example is:
(0, 3)
(11, 14)
## Example
```python
import re
line = "Cats are smarter than dogs";
searchObj = re.search(r'(.*) are (.*?) .*', line, re.M|re.I)
if searchObj:
print"searchObj.group() : ", searchObj.group()
print"searchObj.group(1) : ", searchObj.group(1)
print"searchObj.group(2) : ", searchObj.group(2)
else:
print"Nothing found!!"
The result of executing the above example is as follows:
searchObj.group() : Cats are smarter than dogs
searchObj.group(1) : Cats
searchObj.group(2) : smarter
* * *
## Difference Between re.match and re.search
`re.match` only matches the beginning of the string. If the beginning of the string does not match the regular expression, the match fails and the function returns `None`; whereas `re.search` matches the entire string until a match is found.
## Example
```python
import re
line = "Cats are smarter than dogs";
matchObj = re.match(r'dogs', line, re.M|re.I)
if matchObj:
print"match --> matchObj.group() : ", matchObj.group()
else:
print"No match!!"
matchObj = re.search(r'dogs', line, re.M|re.I)
if matchObj:
print"search --> searchObj.group() : ", matchObj.group()
else:
print"No match!!"
The output of the above example is:
No match!!
search --> searchObj.group() : dogs
* * *
## Search and Replace
Python's `re` module provides `re.sub` for replacing matched items in a string.
**Syntax:**
```python
re.sub(pattern, repl, string, count=0, flags=0)
**Parameters:**
* `pattern`: The pattern string in the regular expression.
* `repl`: The replacement string, which can also be a function.
* `string`: The original string to be searched and replaced.
* `count`: The maximum number of replacements after pattern matching. The default is 0, which means replacing all matches.
## Example
```python
import re
phone = "2004-959-559 # This is an overseas phone number"
num = re.sub(r'#.*$', "", phone)
print"Phone number is: ", num
num = re.sub(r'D', "", phone)
print"Phone number is : ", num
The result of executing the above example is as follows:
Phone number is: 2004-959-559
Phone number is : 2004959559
### repl Parameter is a Function
In the following example, the matched numbers in the string are multiplied by 2:
## Example
```python
import re
def double(matched):
value = int(matched.group('value'))
return str(value * 2)
s = 'A23G4HFD567'
print(re.sub('(?Pd+)', double, s))
The execution output is:
A46G8HFD1134
### re.compile Function
The `compile` function is used to compile a regular expression, generating a regular expression (Pattern) object, which can be used by functions such as `match()`, `search()`, and `findall`.
**Syntax Format:**
```python
re.compile(pattern[, flags])
**Parameters:**
* `pattern`: A regular expression in string form.
* `flags`: Optional, indicating the matching mode, such as ignoring case, multiline mode, etc. The specific parameters are:
1. **re.I**: Ignore case.
2. **re.L**: Makes special character sets `w`, `W`, `b`, `B`, `s`, `S` locale-dependent.
3. **re.M**: Multiline mode.
4. **re.S**: Makes `.` match any character including newline (`.` does not match newline by default).
5. **re.U**: Makes special character sets `w`, `W`, `b`, `B`, `d`, `D`, `s`, `S` dependent on the Unicode character property database.
6. **re.X**: For readability, ignores spaces and comments after `#`.
### Example
## Example
```python
>>>import re
>>>pattern = re.compile(r'd+')
>>>m = pattern.match('one12twothree34four')
>>>print m
None
>>>m = pattern.match('one12twothree34four', 2, 10)
>>>print m
None
>>>m = pattern.match('one12twothree34four', 3, 10)
>>>print m
>>>m.group(0)
'12'
>>>m.start(0)
3
>>>m.end(0)
5
>>>m.span(0)
(3, 5)
In the above, when a match is successful, a Match object is returned, where:
* The `group([group1, β¦])` method is used to obtain one or more group-matched strings. When you want to get the entire matched substring, you can directly use `group()` or `group(0)`.
* The `start()` method is used to get the starting position of the group-matched substring in the entire string (the index of the first character of the substring). The default value of the parameter is 0.
* The `end()` method is used to get the ending position of the group-matched substring in the entire string (the index of the last character of the substring + 1). The default value of the parameter is 0.
* The `span()` method returns `(start(group), end(group))`.
Let's look at another example:
## Example
```python
>>>import re
>>>pattern = re.compile(r'(+) (+)', re.I)
>>>m = pattern.match('Hello World Wide Web')
>>>print m
>>>m.group(0)
'Hello World'
>>>m.span(0)
(0, 11)
>>>m.group(1)
'Hello'
>>>m.span(1)
(0, 5)
>>>m.group(2)
'World'
>>>m.span(2)
(6, 11)
>>>m.groups()
('Hello', 'World')
>>>m.group(3)
Traceback (most recent call last):
File "", line 1, in
IndexError: no such group
### findall
Finds all substrings in the string that match the regular expression and returns a list. If there are multiple matching patterns, it returns a list of tuples. If no match is found, it returns an empty list.
**Note:** `match` and `search` match once, `findall` matches all.
**Syntax Format:**
```python
findall(string[, pos[, endpos]])
**Parameters:**
* `string`: The string to be matched.
* `pos`: Optional parameter, specifying the starting position of the string, default is 0.
* `endpos`: Optional parameter, specifying the ending position of the string, default is the length of the string.
Find all numbers in the string:
## Example
```python
import re
pattern = re.compile(r'd+')
result1 = pattern.findall('tutorial 123 google 456')
result2 = pattern.findall('run88oob123google456', 0, 10)
print(result1)
print(result2)
Output:
['123', '456']
['88', '12']
Multiple matching patterns, returns a list of tuples:
## Example
```python
import re
result = re.findall(r'(w+)=(d+)', 'set width=20 and height=10')
print(result)
[('width', '20'), ('height', '10')]
### re.finditer
Similar to `findall`, it finds all substrings in the string that match the regular expression and returns them as an iterator.
```python
re.finditer(pattern, string, flags=0)
**Parameters:**
| Parameter | Description |
| --- | --- |
| pattern | The regular expression pattern to match. |
| string | The string to be matched. |
| flags | Flag bits used to control the matching mode of the regular expression, such as: whether to distinguish case, multiline matching, etc. See: (#) |
## Example
```python
import re
it = re.finditer(r"d+", "12a32bc43jf3")
for match in it:
print(match.group())
Output:
12
32
43
3
### re.split
The `split` method splits the string according to the matched substrings and returns a list. Its usage is as follows:
```python
re.split(pattern, string[, maxsplit=0, flags=0])
**Parameters:**
| Parameter | Description |
| --- | --- |
| pattern | The regular expression pattern to match. |
| string | The string to be matched. |
| maxsplit | The number of splits. `maxsplit=1` splits once. The default is 0, which means no limit on the number of splits. |
| flags | Flag bits used to control the matching mode of the regular expression, such as: whether to distinguish case, multiline matching, etc. See: (#) |
## Example
```python
>>>import re
>>>re.split('W+', 'tutorial, tutorial, tutorial.')
['tutorial', 'tutorial', 'tutorial', '']
>>>re.split('(W+)', ' tutorial, tutorial, tutorial.')
['', '', 'tutorial', ', ', 'tutorial', ', ', 'tutorial', '.', '']
>>>re.split('W+', ' tutorial, tutorial, tutorial.', 1)
['', 'tutorial, tutorial, tutorial.']
>>>re.split('a*', 'hello world')
['hello world']
* * *
## Regular Expression Objects
### re.RegexObject
`re.compile()` returns a `RegexObject` object.
### re.MatchObject
`group()` returns the string matched by the RE.
* `start()` returns the start position of the match.
* `end()` returns the end position of the match.
* `span()` returns a tuple containing the (start, end) positions of the match.
* * *
## Regular Expression Modifiers - Optional Flags
Regular expressions can contain some optional flag modifiers to control the matching mode. Modifiers are specified as optional flags. Multiple flags can be specified by bitwise ORing (`|`) them together, e.g., `re.I | re.M` sets both the I and M flags:
| Modifier | Description |
| --- | --- |
| re.I | Makes the match case-insensitive. |
| re.L | Performs locale-aware matching. |
| re.M | Multiline matching, affecting `^` and `$`. |
| re.S | Makes `.` match any character including newline. |
| re.U | Parses characters based on the Unicode character set. This flag affects `w`, `W`, `b`, `B`. |
| re.X | This flag gives you more flexibility in formatting the regular expression to make it more readable. |
* * *
## Regular Expression Patterns
Pattern strings use special syntax to represent a regular expression:
Letters and numbers represent themselves. A letter or number in a regular expression pattern matches the same string.
Most letters and numbers have different meanings when preceded by a backslash.
Punctuation symbols only match themselves when escaped; otherwise, they have special meanings.
The backslash itself needs to be escaped with a backslash.
Since regular expressions often contain backslashes, it's best to use raw strings to represent them. Pattern elements (like `r't'`, which is equivalent to `'t'`) match the corresponding special characters.
The following table lists the special elements in regular expression pattern syntax. If you provide optional flag parameters along with the pattern, the meaning of some pattern elements will change.
| Pattern | Description |
| --- | --- |
| ^ | Matches the start of the string. |
| $ | Matches the end of the string. |
| . | Matches any character except newline. When the `re.DOTALL` flag is specified, it can match any character including newline. |
| [...] | Used to represent a set of characters. Listed individually: `` matches 'a', 'm', or 'k'. |
| [^...] | Characters not in the set: `[^abc]` matches any character except a, b, c. |
| re* | Matches 0 or more occurrences of the preceding expression. |
| re+ | Matches 1 or more occurrences of the preceding expression. |
| re? | Matches 0 or 1 occurrence of the preceding expression, non-greedy. |
| re{ n} | Matches exactly n occurrences of the preceding expression. For example, `o{2}` cannot match "Bob" in "Bob", but can match the two o's in "food". |
| re{ n,} | Matches n or more occurrences of the preceding expression. For example, `o{2,}` cannot match "Bob" in "Bob", but can match all o's in "foooood". `o{1,}` is equivalent to `o+`. `o{0,}` is equivalent to `o*`. |
| re{ n, m} | Matches n to m occurrences of the preceding expression, greedy. |
| a| b | Matches a or b. |
| (re) | Groups the regular expression and remembers the matched text. |
| (?imx) | The regular expression contains three optional flags: i, m, or x. Only affects the area within the parentheses. |
| (?-imx) | The regular expression turns off the i, m, or x optional flags. Only affects the area within the parentheses. |
| (?: re) | Similar to (...), but does not represent a group. |
| (?imx: re) | Uses i, m, or x optional flags within the parentheses. |
| (?-imx: re) | Does not use i, m, or x optional flags within the parentheses. |
| (?#...) | Comment. |
| (?= re) | Positive lookahead. Matches if the contained regular expression, represented by ..., matches at the current position, but fails otherwise. However, once the contained expression has been tried, the matching engine has not advanced; the rest of the pattern must still try to match to the right of the lookahead. |
| (?! re) | Negative lookahead. The opposite of the positive lookahead; matches if the contained expression cannot match at the current position of the string. |
| (?> re) | Independent pattern, avoiding backtracking. |
| w | Matches any alphanumeric character and underscore. |
| W | Matches any non-alphanumeric character and underscore. |
| s | Matches any whitespace character, equivalent to ``. |
| S | Matches any non-whitespace character. |
| d | Matches any digit, equivalent to ``. |
| D | Matches any non-digit character. |
| A | Matches the start of the string. |
| Z | Matches the end of the string. If there is a newline, it only matches the end string before the newline. |
| z | Matches the end of the string. |
| G | Matches the position where the last match occurred. |
| b | Matches a word boundary, i.e., the position between a word and a space. For example, 'erb' can match the 'er' in "never", but cannot match the 'er' in "verb". |
| B | Matches a non-word boundary. 'erB' can match the 'er' in "verb", but cannot match the 'er' in "never". |
| n, t, etc. | Matches a newline character. Matches a tab character, etc. |
| 1...9 | Matches the content of the nth group. |
| 10 | Matches the content of the nth group if it has been matched. Otherwise, it refers to the octal character code expression. |
* * *
## Regular Expression Examples
#### Character Matching
| Example | Description |
| --- | --- |
| python | Matches "python". |
#### Character Classes
| Example | Description |
| --- | --- |
| ython | Matches "Python" or "python". |
| rub | Matches "ruby" or "rube". |
| | Matches any one letter in the brackets. |
| | Matches any digit. Similar to . |
| | Matches any lowercase letter. |
| | Matches any uppercase letter. |
| | Matches any letter and digit. |
| [^aeiou] | Matches any character except the vowels a, e, i, o, u. |
| [^0-9] | Matches any character except a digit. |
#### Special Character Classes
| Example | Description |
| --- | --- |
| . | Matches any single character except "n". To match any character including 'n', use a pattern like '[.n]'. |
| d | Matches a digit character. Equivalent to . |
| D | Matches a non-digit character. Equivalent to [^0-9]. |
| s | Matches any whitespace character, including space, tab, form feed, etc. Equivalent to . |
| S | Matches any non-whitespace character. Equivalent to [^ fnrtv]. |
| w | Matches any word character including underscore. Equivalent to ''. |
| W | Matches any non-word character. Equivalent to '[^A-Za-z0-9_]'. |
YouTip