18 Regular Expressions
Prerequisites (read first if unfamiliar): Chapter 17.
See also: Chapter 20, Chapter 11, Chapter 6.
Purpose

Sooner or later you will need to extract phone numbers from free-text notes, find every file name that matches a pattern, strip junk out of a column, or validate that a user-entered email address at least looks like one. These are all jobs for regular expressions — a small pattern language that matches shapes in text.
Regex has a reputation for being write-once, read-never code. That reputation is earned when people try to express overly clever patterns. For the 90% case you will actually use in data work — finding specific words, extracting substrings, matching digits, cleaning whitespace — regex is a modest, learnable tool. This chapter teaches you enough to use it confidently in pandas, Python scripts, text editors, and the terminal without reaching for a cheat sheet every time.
Learning objectives
By the end of this chapter, you should be able to:
- Explain what a regular expression is and name three situations where it is the right tool.
- Read a short regex and predict what it will match.
- Use literal characters, the seven core metacharacters, character classes, and anchors to write patterns.
- Use
re.search,re.match,re.findall, andre.subin Python’sremodule with sensible defaults. - Use capture groups to extract parts of a match.
- Use
pandas.Series.str.contains,.str.extract, and.str.replacefor regex on DataFrames. - Recognize when a regex is becoming too clever and reach for a real parser instead.
Running theme: match shapes, not meaning
Regex matches the shape of text — three digits, a dot, four digits — not its meaning. If your problem requires understanding what the text means (parsing HTML, real emails, code, dates with validation), regex is the wrong tool. If it requires finding a pattern of characters, regex is perfect.
18.1 The seven metacharacters you actually need
The entire language is built from a small set of special characters that mean “not themselves.” Here are the ones you will use constantly:
| Symbol | Meaning |
|---|---|
. |
any single character except newline |
* |
zero or more of the preceding item |
+ |
one or more of the preceding item |
? |
zero or one of the preceding item (makes it optional) |
^ |
start of string (or line, with re.MULTILINE) |
$ |
end of string (or line) |
\| |
alternation: cat\|dog matches either |
Plus these two for grouping and escaping:
| Symbol | Meaning |
|---|---|
(...) |
capture group — saves what matched for later extraction |
\ |
escape — \. means a literal dot, \\ means a literal backslash |
That is most of what you need to know to read 90% of the regexes you encounter in the wild.
18.2 Character classes: the shortcut for “one of these”
A character class matches exactly one character from a set:
| Syntax | Matches |
|---|---|
[abc] |
a, b, or c |
[a-z] |
any lowercase letter |
[A-Za-z0-9] |
any alphanumeric character |
[^abc] |
any character except a, b, or c |
\d |
any digit (equivalent to [0-9]) |
\w |
any “word” character ([A-Za-z0-9_]) |
\s |
any whitespace (space, tab, newline) |
\D, \W, \S |
the negations |
Combine with +, *, ?, or {n,m} to repeat:
| Syntax | Matches |
|---|---|
\d+ |
one or more digits |
\d{3} |
exactly 3 digits |
\d{3,5} |
between 3 and 5 digits |
\d{2,} |
2 or more digits |
18.3 Anchors: where in the string
| Syntax | Matches |
|---|---|
^foo |
string starts with foo |
foo$ |
string ends with foo |
\bfoo\b |
word boundary: foo as a whole word, not food or tofoo |
Word boundaries (\b) are the most underused regex feature for data work. They are the difference between matching cat in "the cat sat" (what you want) and also matching it in "concatenate" (what you do not).
18.4 Python’s re module
Import once at the top of your script or notebook:
import reThe four functions you will use most:
re.search(pattern, text) # find first match anywhere, or None
re.match(pattern, text) # match only at the start of text
re.findall(pattern, text) # list of all non-overlapping matches
re.sub(pattern, replacement, text) # replace all matchesAlways use raw string literals for patterns. Python strings interpret backslashes (\n is a newline), and regex uses backslashes for its own special meanings. The raw-string prefix r"" tells Python not to interpret them:
# Good
re.search(r"\d+", "order #42")
# Bad: Python sees \d as an invalid escape (or silently a literal 'd')
re.search("\d+", "order #42")Make raw strings a reflex.
Flags
Two flags come up constantly:
re.search(r"python", "Python", re.IGNORECASE) # case-insensitive
re.findall(r"^\w+", big_text, re.MULTILINE) # ^ matches each lineYou can combine them: re.IGNORECASE | re.MULTILINE.
18.5 Capture groups: extracting parts of a match
Parentheses do two jobs: they group a sub-pattern and they capture what matched so you can pull it out later.
text = "Order placed on 2024-03-15 for $29.99"
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)
if match:
year, month, day = match.group(1), match.group(2), match.group(3)
print(year, month, day) # 2024 03 15You can also reference groups in the replacement string of re.sub:
re.sub(r"(\w+)@(\w+)", r"\2.\1", "alice@example")
# 'example.alice'And you can give groups names for readability:
m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", text)
m.group("year") # '2024'18.6 Regex in pandas
pandas has regex built into its string methods. Three you will reach for constantly:
.str.contains(pattern) — boolean mask for rows that match:
mask = df["note"].str.contains(r"refund|chargeback", case=False, na=False)
df[mask]Always pass na=False unless you specifically want NaN to propagate; otherwise a missing value becomes a missing mask and the row filter breaks.
.str.extract(pattern) — pull capture groups into new columns:
# Extract the order number from a free-text note column
df["order_id"] = df["note"].str.extract(r"Order #(\d+)")If your pattern has multiple groups, you get a DataFrame:
parts = df["date"].str.extract(r"(\d{4})-(\d{2})-(\d{2})")
parts.columns = ["year", "month", "day"].str.replace(pattern, repl, regex=True) — substitute matches:
df["phone"] = df["phone"].str.replace(r"[^\d]", "", regex=True)That last example strips every non-digit character, normalizing (303) 555-1212 and 303.555.1212 both to 3035551212.
18.7 When not to use regex
Regex is a hammer. It is not the right tool for:
- Parsing HTML or XML. Use
BeautifulSouporlxml. HTML is not regular, and regex on it is a well-known anti-pattern with many famous rants about it. - Parsing JSON. Use the
jsonmodule. See Chapter 20. - Validating real email addresses or URLs. The official email regex is thousands of characters long. Use a dedicated library (
email-validator,urllib.parse) or a simple “contains an @ and a dot” sanity check. - Anything where you need to understand structure, not match shape. If the text has nested elements, recursion, or balanced brackets, regex will frustrate you. Reach for a real parser.
A good rule of thumb: if your regex is longer than one line or contains more than three (...) groups, reconsider.
18.8 Stakes and politics
Regular expressions are syntactic plumbing, but the plumbing has a cultural assumption baked in: matching is about ASCII. The character classes most tutorials introduce — \w for “word character,” \d for “digit,” [A-Za-z] for “letter” — were defined for an English-letter, Arabic-numeral world, and they quietly do the wrong thing in any other one. \d matches 0–9 but not Arabic-Indic or Devanagari digits unless you opt into Unicode mode; \w matches [A-Za-z0-9_] and excludes accented letters, Cyrillic, Han characters, and combining marks. A “validate this name” regex written in the default English-centric way silently rejects names containing characters that two-thirds of the world’s people use.
See Chapter 8 for the broader framework. The concrete prompt to carry forward: when you write a regex over real text, ask whose alphabets it implicitly assumes — and decide deliberately whether that assumption is right for your data, or whether you should pass re.UNICODE (or, in pandas, work with .str methods that handle Unicode by default).
18.9 Worked examples
Extracting order IDs from free-text notes
You have a column of customer service notes and you need the order ID mentioned in each one.
import pandas as pd
notes = pd.Series([
"Customer called about Order #4829, refund requested",
"Order #1337 shipped late",
"No order mentioned",
"Orders #9999 and #1000, dispute",
])
orders = notes.str.extract(r"Order #(\d+)")
print(orders)Output:
0
0 4829
1 1337
2 NaN
3 9999
Note that row 3 only extracted the first match. If you need all matches per row, use .str.findall(r"Order #(\d+)") instead and deal with the list.
Normalizing phone numbers
phones = ["(303) 555-1212", "303.555.1212", "+1 303 555 1212", "3035551212"]
cleaned = [re.sub(r"[^\d]", "", p) for p in phones]
# ['3035551212', '3035551212', '13035551212', '3035551212']Now you can compare them (modulo the country code).
Simple validation with re.fullmatch
def looks_like_us_zip(s: str) -> bool:
return bool(re.fullmatch(r"\d{5}(-\d{4})?", s))
looks_like_us_zip("80301") # True
looks_like_us_zip("80301-1234") # True
looks_like_us_zip("803011") # False
looks_like_us_zip("80301 ") # False (trailing space)re.fullmatch requires the pattern to cover the entire string — much safer than re.match for validation.
Grep from the terminal
The grep command uses regex too. From Chapter 11:
grep -E '^def \w+' src/*.py # every function definition
grep -rn 'TODO|FIXME' src/ # all TODO / FIXME markers
grep -vE '^\s*#' config.cfg # strip comment linesYour regex skill transfers directly.
18.10 Templates
A cheat-sheet for the patterns you will reuse most often:
r"\d+" # one or more digits
r"\d{3}-\d{4}" # 3 digits, dash, 4 digits
r"[A-Za-z]+" # one or more letters
r"\s+" # whitespace run (use for splitting or cleanup)
r"^\s+|\s+$" # leading or trailing whitespace (alt to .strip())
r"\b\w+\b" # whole words
r"[A-Z][a-z]+" # a capitalized word (names, usually)
r"#\w+" # hashtag
r"@\w+" # mention / username
r"\d{4}-\d{2}-\d{2}"# ISO-ish date (shape only, not validation)18.11 Exercises
- Write a regex that matches a US phone number written as
(xxx) xxx-xxxx,xxx-xxx-xxxx, orxxx.xxx.xxxx. Test it on five variations. - You have a log file with lines like
2024-03-15 14:22:03 ERROR Failed to connect. Write a regex with capture groups that extracts the date, time, level, and message. - Given a pandas Series of URLs, use
.str.extractto pull out the domain (the part between://and the next/). - Write a regex that matches words of 4–7 letters from a block of English text. Use
\b. Find a short paragraph to test on. - Use
re.subto redact credit card numbers from a string, replacing any 16-digit run withXXXX-XXXX-XXXX-XXXX. - Using the terminal, run
grep -Ewith a regex to find every line in your Python source files that starts withdeforclass— a quick index of your API. - Take a regex you find confusing and rewrite it on paper, breaking it into pieces and explaining each. If you cannot, it is probably too clever and a simpler approach exists.
18.12 One-page checklist
- Use raw strings (
r"...") for every Python regex. - Start with the simplest thing that matches what you want; only add complexity when it fails.
- Anchor with
\b,^, or$when you want exact boundaries. - Use character classes (
\d,\w,\s) rather than long bracket groups. - Use
re.fullmatchfor validation,re.searchfor extraction,re.findallfor all matches,re.subfor replacement. - In pandas, use
.str.contains(..., na=False),.str.extract, and.str.replace(..., regex=True). - Test your regex on at least one “expected” and one “unexpected” input before trusting it.
- Reach for a parser (BeautifulSoup, json, etc.) when the text has structure.
- If your regex is over one line or has lots of groups, consider rewriting in Python code.
- Python docs,
remodule reference — the authoritative list of pattern syntax and functions in Python’s regex engine. - Python docs, Regular Expression HOWTO — a longer-form tutorial that builds intuition before you reach for the reference.
- regex101 — an interactive tester that explains every part of a pattern as you type; switch the flavor to “Python” and your local results match what
rewill do. - Jeffrey Friedl, Mastering Regular Expressions — the standard book on regex internals across languages; worth knowing exists when a complex pattern is fighting you.
- Unicode Consortium, UTS #18: Unicode Regular Expressions — the technical standard for what “regex over Unicode” should mean; useful context for the ASCII-bias issue raised in “Stakes and politics” above.
- Steven Levithan and Jan Goyvaerts, Regular-Expressions.info — a deep, language-agnostic reference for regex syntax and engine behaviors.