6 Debugging

Prerequisites and see-also

Prerequisites (read first if unfamiliar): Chapter 11.

See also: Chapter 7, Chapter 2, Chapter 29.

Purpose

Crying Floor Meme: a person is lying on the floor crying, saying ‘It worked yesterday’ and ‘Now nothing works.’

When a program fails, it can feel like the computer is “mad at you.” In reality, most bugs are ordinary mismatches between what you think the computer is doing and what it is actually doing. Debugging is the practice of finding that mismatch efficiently and fixing it without breaking something else.

In computing courses, beginners sometimes treat debugging as a chaotic activity: rerun the same cell, change random lines, search the error message, and hope it works. Professionals do something different. They treat debugging as a structured investigation. They narrow the search space, form hypotheses, run controlled experiments, collect evidence (including logs), and confirm the fix with tests.

This chapter gives you a repeatable debugging workflow you can use in Python scripts, Jupyter notebooks, spreadsheets, and command-line tools. The emphasis is not on fancy tools. The emphasis is on disciplined thinking: decomposition, minimal reproduction, instrumentation, and verification.

Learning objectives

By the end of this chapter, you should be able to:

Describe debugging as an investigation: symptoms, hypotheses, experiments, and evidence.
Produce a minimal reproducible example (MRE) that isolates a bug.
Decompose a failing program into smaller units and test them independently.
Read error messages and stack traces to locate the relevant failure point.
Use print statements and the logging module to collect useful diagnostic information.
Write small tests (including “smoke tests”) to confirm that your fix works and stays fixed.
Avoid common debugging traps such as random edits, confirmation bias, and stale state in notebooks.
Use AI tools to assist debugging without outsourcing verification or creating new risks.

Running theme: change one thing, observe one thing

Most debugging becomes easier when you adopt one rule: make one change at a time, then observe the result. When you change multiple things at once, you cannot tell which change mattered. Debugging is a science experiment, not a guessing game.

6.1 A mental model: debugging as an evidence-driven loop

A bug is a situation where the program behaves differently than you expect. Debugging is the process of reconciling expectations with reality.

A useful way to think about debugging is a loop:

State the symptom. What exactly went wrong? What did you expect instead?
Reproduce it reliably. Can you make it fail again on demand?
Localize the problem. Where (roughly) does the failure occur?
Form hypotheses. What could cause this symptom?
Run a controlled experiment. Change one thing to test one hypothesis.
Collect evidence. Use output, logs, and small checks.
Apply a fix. Make the smallest change that resolves the cause.
Verify and prevent regression. Confirm the fix and write a test.

This loop is not linear. You may circle back several times. But it is structured: each iteration should reduce uncertainty.

Why the “random edits” strategy fails

Beginners often respond to a bug by changing multiple lines, rerunning, and hoping the symptom goes away. This feels productive, but it fails for three connected reasons. First, you lose causality: if the bug does happen to disappear, you do not know which of your changes was responsible, which means you have not actually learned anything you can apply to the next bug. Second, random edits routinely break code that was already working — the bug “moves” instead of vanishing, and you now have two problems instead of one. Third, even when this approach eventually works, it is the slowest possible way to solve the problem: you spend effort without building understanding, so the same kind of bug will trap you again next week.

A disciplined debugging workflow feels slower for the first few minutes and is dramatically faster after that, because it spends each minute reducing uncertainty rather than spinning in confusion.

6.2 Start with a clear problem statement

Before you dive into code, write a one-sentence statement:

When I do X, I expect Y, but instead I observe Z.

Examples:

“When I call pd.read_csv on this file, I expect three columns, but I get one combined column.”
“When I run python pipeline.py, I expect an outputs/ folder, but nothing is created and there is no error.”
“When I merge my branch, I expect a clean history, but I get a merge conflict in analysis.ipynb.”

This statement forces you to name your expectation. Many issues turn out to be misunderstandings of what a function or tool is supposed to do.

6.3 Reproduce the bug and capture the evidence

If you cannot reproduce a bug, you cannot reliably confirm a fix. Reproduction does not always mean “every time.” It means “often enough that you can test changes.”

What to capture

When something fails, the very first move is to capture the evidence before it disappears. Record the exact commands you ran and the directory you were in when you ran them — pwd and your shell history are your friends here. Copy and paste the exact error message rather than retyping it, because retyping introduces tiny mistakes and you need the literal text to search for it later. Save the full stack trace if there is one (not just the bottom line). Note the inputs that triggered the failure: the file path, the parameters you passed, and a small sample of the data if it is something you can share. And capture the environment: your OS, your Python version, the versions of the key packages, and which environment is currently active.

# What to capture, all in one go
pwd                                # working directory
python --version
python -c "import sys; print(sys.executable)"
pip show pandas | head -2          # package + version
# then copy-paste the failing command and its full output

A practical rule of thumb: if the information would disappear when you close the terminal or restart the notebook, write it down somewhere persistent before you start trying fixes.

Reproduction in notebooks versus scripts

Jupyter notebooks are convenient, but they create a common debugging hazard: hidden state. You can run cells out of order, redefine variables, or keep stale objects in memory. Two ways to protect yourself:

Restart the kernel and run all cells from top to bottom.
When you can, reproduce the issue in a plain script.

If a bug appears in a notebook but not in a script (or vice versa), that difference is evidence.

6.4 Read error messages and stack traces

Error messages are not insults. They are structured signals.

The difference between an error and a symptom

Sometimes the error message is the symptom (e.g., FileNotFoundError). Sometimes the symptom is incorrect output with no error (e.g., a column is all zeros). Both require debugging. Errors are easier because the program tells you where it stopped.

How to read a Python stack trace

A Python stack trace shows the sequence of function calls that led to the error. Beginners often stare at the last line only. Instead:

Find the exception type (e.g., KeyError, TypeError).
Read the exception message (it often includes the missing key or wrong type).
Scan upward for the first line that refers to your code (a file path in your project).
Treat the frames below that line as internal details of libraries.

Common novice mistake: trying to “fix” library code in site-packages. If the traceback points into a library, the cause is usually your inputs or your environment.

Error taxonomies: learn a few recurring families

You do not need to memorize every Python exception, but it helps to recognize a small number of families and what each one usually means about where to look. Name and scope errors like NameError are about variables that Python has never heard of — almost always a typo, a forgotten import, or a cell in the notebook that you have not run yet. Type mismatches like TypeError are about operations applied to the wrong kind of object: adding a string to an integer, calling something that is not a function, or passing the wrong number of arguments. Indexing and key errors (IndexError, KeyError) are about reaching for an element of a list, dict, or DataFrame column that does not exist — usually a column name typo or an off-by-one in a loop. File and path errors like FileNotFoundError are about the working directory or permissions, not the code itself. Parsing and format errors, almost always raised as ValueError, are about a value that has the right type but the wrong shape — a string 'N/A' where a number was expected, or a date in a format Python cannot parse. And import and environment errors like ModuleNotFoundError are about which Python is running and which packages are installed in that Python (see Chapter 14).

Each family points you at a different first move. A KeyError should make you reach for print(df.columns.tolist()). A ModuleNotFoundError should make you reach for which python and pip list. A FileNotFoundError should make you reach for pwd and ls. The taxonomy is useful precisely because it tells you what to do next, not just what went wrong.

For a fuller treatment of the most common Python exceptions and how to read the surrounding stack frames, see Chapter 7.

6.5 Decomposition: make the problem smaller

Decomposition is the most important debugging skill. You reduce a complex failure to a small failure.

Three decomposition strategies

The first strategy is divide and conquer: split the workflow into stages and find where the bug first appears. A typical data-science pipeline has six stages — load data, clean and transform, compute features, fit a model, evaluate it, and produce outputs — and the bug almost always lives at the boundary between two of them. Run each stage separately, inspect the intermediate result, and ask “is this what I expected at this point?” The first place where the answer is “no” is the place where the bug actually happens, even if the symptom shows up much later.

# Divide and conquer: check the intermediate after every stage
df = load_raw_data("data.csv");        print("loaded:", df.shape)
df = clean_columns(df);                 print("cleaned:", df.shape)
df = compute_features(df);              print("features:", df.shape)
# the first stage where shape or columns surprise you is the bug site

The second strategy is binary search over history: if the code worked yesterday and is broken today, look at what changed. Version control makes this dramatically faster — git log --oneline lists the recent commits, git diff HEAD~5 shows the cumulative diff over the last five, and git bisect will literally do the binary search for you, asking you to mark commits as “good” or “bad” until it isolates the exact one that introduced the bug. Even without git, you can usually copy your last working version into a separate folder and diff the two.

The third strategy is to strip to a minimal reproducible example: take the failing code and aggressively delete anything that is not essential to triggering the bug. Each deletion that still fails is a piece of evidence about which code is irrelevant. The endpoint is a tiny script — usually fewer than 20 lines — that reproduces the failure with no surrounding noise. At that point, the bug is almost always obvious, and even if it is not, you have produced exactly the artifact you need to ask for help (see Chapter 2).

The minimal reproducible example (MRE) as a debugging tool

An MRE is often described as a tool for asking questions, but it is just as valuable as a debugging instrument in its own right. The act of producing a 15-line script that recreates your bug forces you to learn three things you may not have noticed: which inputs actually matter (the ones you can’t delete without losing the failure), which library call is the immediate trigger (the line you can’t remove), and what assumptions you were silently making (the things you have to add to the MRE to get it to fail at all). Most of the time, by the time you finish reducing the example, you have already found the bug.

Practical MRE techniques

A few mechanical techniques make MRE construction faster. The single most useful one is to replace real data with synthetic data — pandas reads from io.StringIO exactly as it does from a file, so a few lines of inline CSV are enough to recreate most data-loading bugs without any external file:

import pandas as pd
from io import StringIO

raw = "name,age\nada,35\nlin,N/A\n"
df = pd.read_csv(StringIO(raw))
df["age"].mean()           # reproduces the ValueError without a real file

Closely related: hard-code a small example. A three-row DataFrame or a five-element list is almost always enough to reproduce a logic bug, and it’s small enough to reason about end to end. Delete code aggressively — pull out everything that isn’t required to trigger the failure, and if removing a block does not change the symptom, it was not relevant. Finally, freeze any randomness. If your bug only sometimes appears, set the random seeds (random.seed(0), numpy.random.seed(0), torch.manual_seed(0)) so that “sometimes” becomes “every time on this seed,” which is debuggable.

6.6 Hypotheses and controlled experiments

Once you have localized the issue, do not jump straight to a fix. First, form a hypothesis.

What a good hypothesis looks like

A useful debugging hypothesis is specific and falsifiable: it makes a concrete claim about cause and effect that you can test in a few minutes. “The file path is relative to the working directory, and I am running from the wrong folder” is a good hypothesis — you can test it with pwd and ls in 30 seconds. “This column is stored as strings with embedded commas, so numeric conversion silently fails” is a good hypothesis — you can test it with df["price"].dtype and df["price"].head(). “I installed pandas into a different environment than the one running my notebook” is a good hypothesis — you can test it with import sys; print(sys.executable) from inside the notebook.

The contrast is with vague non-hypotheses like “something is wrong with pandas” or “my computer is broken.” These are not hypotheses at all, because there is no test you could run that would prove them right or wrong. When you find yourself reaching for one of those, that is the moment to slow down and ask, “What concretely do I think is happening, and how would I know?”

Designing a controlled experiment

A controlled experiment changes exactly one factor and observes exactly one outcome. The “one factor” rule is the entire point: if you change two things at once, you cannot interpret the result. The most common experiments are tiny — printing a variable right before the line that crashes, replacing a single suspect input with a known-good value, running the same code in a fresh shell or a new venv to rule out environment state, or commenting out a single transformation step to see whether the bug is upstream or downstream of it.

# Two controlled experiments, one factor each
print("--- before merge ---")
print(df_a.shape, df_b.shape)              # experiment 1: shapes upstream
result = df_a.merge(df_b, on="id")
print("--- after merge ---")
print(result.shape)                         # experiment 2: shape downstream

Whatever experiment you run, write down what you tried and what happened. Even when the bug remains, the run is progress: you have ruled something out, and your search space just got smaller.

6.7 Instrumentation: print statements and sanity checks

Instrumentation means adding temporary measurements to observe program state.

Strategic printing

Print statements are a legitimate debugging tool when used strategically. Good print debugging follows these rules:

Print labels and values (so you know what you are seeing).
Print right before and right after suspicious lines.
Print shapes, types, and small samples—not entire datasets.
Remove or convert prints to logs after you finish.

Examples of useful prints in data work:

print(df.shape)
print(df.dtypes)
print(df.head(3))
print(df[’col’].isna().mean())

Assertions as executable assumptions

An assertion is a statement that should be true. If it is false, the program stops with a clear signal.

For beginners, assertions are useful because they turn silent wrongness into loud wrongness.

Examples:

assert df.shape[0] > 0, "Dataframe has no rows"
assert 'date' in df.columns, "Missing expected column: date"
assert df['age'].min() >= 0, "Negative age values present"

Use assertions to encode assumptions you would otherwise hold in your head.

6.8 Logging: debugging that scales beyond one run

Print statements are fine during exploration, but logging is better when:

your code runs for a long time,
you run it as a scheduled job,
you need to keep evidence for later,
multiple people will run the code.

What logging is (and is not)

Logging is a structured way to record events. It is not the same as printing everything. Logs should help you answer:

Where did the program get to?
What inputs and configuration did it use?
How long did steps take?
Why did it fail?

Logging levels

Most logging systems have levels such as DEBUG, INFO, WARNING, ERROR. A beginner-friendly interpretation:

DEBUG: details useful for developers while diagnosing.
INFO: major milestones (started, loaded data, finished).
WARNING: something unexpected but not fatal.
ERROR: the operation failed.

A minimal Python logging setup

You do not need a complicated configuration. A simple pattern:

import logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s"
)
logger = logging.getLogger(__name__)

logger.info("Starting pipeline")

Then replace prints with logger.info(...) or logger.debug(...).

Avoid logging secrets

Never log:

passwords,
API keys,
private personal data,
full rows of sensitive datasets.

If you need to confirm that a token exists, log only that it is set, not its value.

6.9 Testing: confirm fixes and prevent regressions

Testing is the final stage of debugging. Without tests, a bug can return quietly.

What a test is

A test is an executable claim about behavior. It says: for this input, the output should satisfy this condition.

Beginners often think tests are only for large professional projects. In reality, small tests are one of the best study tools in programming.

A testing ladder for novices

Start small and build up:

Smoke tests: does the script run end-to-end without crashing?
Property checks: outputs have expected shape, columns, ranges.
Unit tests: a single function behaves correctly on small inputs.
Integration tests: multiple components work together.

For most class projects, smoke tests and a handful of unit tests are enough.

Write a test that would have caught the bug

After fixing a bug, ask: “What condition was violated?” Then write a test that fails on the old behavior and passes on the new.

Examples:

If a column was missing: test that required columns exist.
If parsing failed on a weird value: test that that value is handled.
If a function returned wrong units: test numeric values within expected range.

Golden rule: tests should be deterministic

A test should give the same result every run. If randomness is involved, set a seed or test statistical properties rather than exact values.

6.10 Debugging in common environments

Different environments create different failure modes.

Notebook-specific hazards

Notebooks are wonderful for exploration and treacherous for debugging, because they invite hidden state. The four classic traps are running cells out of order (so the variable on screen is not the variable in memory), redefining a function in a later cell and forgetting that the earlier cells still hold the old version, keeping a stale DataFrame around after you thought you had replaced it, and installing a package into one environment while the notebook kernel is running in another.

The single most reliable countermeasure is Kernel → Restart and Run All. If your code does not work after a clean restart, it does not work, and any “success” you saw before was an illusion produced by leftover state. Beyond that, give your cells single, well-defined responsibilities so it is harder to introduce side effects accidentally, and when something is mysterious, print sys.executable, pandas.__version__, and the relevant variable’s type(...) and id(...) to confirm that the world is what you think it is.

# When confused, dump the basics
import sys, pandas as pd
print("python:", sys.executable)
print("pandas:", pd.__version__)
print("df type:", type(df), "df id:", id(df), "df shape:", df.shape)

Command-line and OS-level bugs

Not every bug lives in the code. A surprising number of “code” bugs are actually environment bugs in disguise: you are in the wrong working directory, the file you are trying to read does not have the right permissions, your shell’s PATH does not point at the Python you think it does, the file is encoded in something other than UTF-8, or you are hitting a line-ending difference between Windows and macOS. Each of these can manifest as something that looks like a Python error but cannot be fixed by changing Python code. (See Chapter 11 for the full command-line toolbox.)

The diagnostic move that catches most of these in one go is to compare environments: if a command works in one terminal but not another, the environment is the suspect, not the code. Run the same command in both and compare the output of pwd, which python, echo $PATH, and python --version. The first place these diverge is the place to investigate.

6.11 A practical debugging checklist

When you feel stuck, use this checklist as a reset:

What exactly is the symptom (expected vs actual)?
Can I reproduce it?
What is the smallest example that fails?
Where is the failure located (line/function/stage)?
What are 2–3 plausible hypotheses?
What experiment tests one hypothesis with one change?
What evidence will confirm or refute it?
After the fix, what test will prevent regression?

Print it and keep it near your desk.

6.12 Stakes and politics

Debugging treats a bug as an objective discrepancy between expected and observed behavior, and most of the time it is. The political dimension shows up at the edges, in the question of which discrepancies count as bugs worth fixing. “Works on my machine” is a famous developer joke, but it has a serious version — bug reports that fail to reproduce in the maintainer’s environment routinely get closed as “cannot reproduce,” and the reporters who see the bug most often are the ones whose environments differ most from the developers’. Users on right-to-left scripts, on assistive technology, on low-bandwidth connections, on older hardware, and on non-English locales all encounter classes of bug that the dominant developer profile rarely sees, and those classes get fixed last (if at all).

See Chapter 8 for the broader framework. The concrete prompt to carry forward: when you cannot reproduce someone else’s bug, ask whose environment yours quietly assumes before deciding the bug is not real.

6.13 Worked examples

The goal of these worked examples is to show the loop in action.

“File not found” that is really “wrong folder”

You run a script and Python tells you FileNotFoundError: data/input.csv. Rather than editing the path at random, you stop and gather evidence about where the script thinks it is. Two commands settle it:

import os
print("cwd:", os.getcwd())
print("data dir:", os.listdir("data") if os.path.isdir("data") else "missing")

If cwd is the project root, the path is fine and the file is genuinely missing. If cwd is some other directory, the path is relative to that other directory and the bug is not in your code at all — you just ran the script from the wrong place. The hypothesis is: the script uses a relative path and you are running it from the wrong directory. The experiment is: run the script from the project root, or rewrite the path so it is computed from __file__ and is independent of the working directory:

from pathlib import Path   # https://docs.python.org/3/library/pathlib.html
HERE = Path(__file__).resolve().parent
DATA = HERE / "data" / "input.csv"
assert DATA.exists(), f"Missing {DATA}"

That assertion is the verification step: it turns the silent assumption (“the file is here”) into a loud check that fails immediately if the assumption ever breaks again. The lesson generalizes: many “code” failures are really about context, not logic. Always confirm where you are before you change what you do.

“It runs but the results are wrong”

Silent wrongness is much harder than a crash. Suppose you compute the average age in your dataset and get back 0.0, which is obviously wrong but does not raise any exception. The investigation is decomposition: break the pipeline into stages — load the ages, convert them to numeric, compute the average — and inspect the intermediate result after each stage:

print(df["age"].head())              # what do the values actually look like?
print(df["age"].dtype)               # is the column numeric or object?
print(df["age"].isna().mean())       # how many are NaN?

In this case the column is object, the head shows '35', '42', 'unknown', and the NaN rate is 80%. Hypothesis: ages were read as strings because of the 'unknown' sentinel, so pd.to_numeric produced mostly NaNs, and your average call ignored them — leaving a near-zero result. The fix is to handle the missing values explicitly at load time (na_values=['unknown']) and to add a test that locks in an expected non-NaN rate going forward, so the next time someone changes the ingestion the silent failure cannot return. The general lesson is that debugging silent wrongness almost always comes down to inspecting intermediate representations rather than the final answer.

6.14 Using AI tools in debugging

AI tools can help you debug, but they can also increase confusion if you treat them as authoritative.

Good uses of AI

Summarize an error message and propose likely categories (type mismatch, missing key).
Suggest questions to ask (What is the dtype? What is the working directory?).
Propose an MRE by stripping code.
Draft a unit test skeleton once you know the expected behavior.

Guardrails

Do not paste secrets or private data.
Verify AI-suggested commands in official docs.
Prefer small diffs: one change at a time.
If the AI proposes a fix, make it fail/pass with a test.

A practical motto: AI can suggest hypotheses; you supply the evidence.

6.15 Templates

Template A: debugging journal entry

When debugging takes more than a few minutes, keep a short journal:

Symptom:
Expected vs actual:
Reproduction steps:
Evidence captured (error/trace/logs):
Hypotheses:
Experiments tried (one per line) + outcomes:
Fix applied:
Test added:

Template B: minimal test checklist

Test name describes behavior.
Inputs are small and synthetic.
Expected outcome is explicit.
Test is deterministic.
Test fails on the buggy version.

6.16 Exercises

Take a recent error you encountered. Write a one-sentence symptom statement (X, expect Y, observe Z).
Create an MRE that reproduces the error in fewer than 20 lines.
Add two assertions that encode assumptions about your data (columns, ranges, missingness).
Convert three print statements into logging calls with levels.
Fix a bug and then write a unit test that would have caught it.
In a notebook, intentionally create a hidden-state bug (run cells out of order), then fix it by restarting and re-running from top.

6.17 One-page checklist

I can state the symptom clearly (expected vs actual).
I can reproduce the bug and capture the evidence.
I can localize the failure and reduce scope.
I form hypotheses and test them with one-change experiments.
I use prints/assertions/logs to collect useful signals.
I verify the fix and add a test to prevent regression.
In notebooks, I manage hidden state (restart + run all).
If I use AI tools, I treat outputs as drafts and verify with tests.

📚 Further reading

Python logging HOWTO — the official walk-through of loggers, handlers, and levels.
Real Python: Python Debugging with pdb — a clean, beginner-friendly introduction to Python’s built-in interactive debugger.
Software Carpentry: Python Debugging lesson — a short, scaffolded lesson on systematic debugging with worked examples.
John Regehr, How to Debug — a compact, opinionated essay from a systems researcher on hypothesis-driven debugging that translates well to scientific Python.
Julia Evans, Bite Size Debugging — a short illustrated zine covering print debugging, strace, gdb, and the mental moves that work across languages.
Andreas Zeller, The Debugging Book — a free interactive textbook covering tracing, deltas, fuzzing, and automatic debugging; useful when you want to go beyond print statements.