20 Data File Formats
Prerequisites (read first if unfamiliar): Chapter 10, Chapter 17.
See also: Chapter 16, Chapter 14, Chapter 7.
Purpose

A huge fraction of your time in a first data science course is spent moving data from a file into a pandas DataFrame. On paper, this is a one-liner: pd.read_csv("data.csv"). In practice, real-world files introduce an astonishing number of ways that simple line can go wrong — wrong delimiter, broken encoding, missing-value sentinels that sneak in as strings, dates parsed as objects, silently duplicated rows, multi-level headers, Excel sheets that aren’t sheets, JSON that is almost but not quite valid.
This chapter is a field guide to the five file formats you will meet most often as a novice data scientist — CSV, TSV, JSON, Excel, and Parquet — with the specific quirks of each, the pandas reader invocation that will actually work, and the debugging steps for when it does not. The goal is not to make you a file format expert but to give you enough mental model to recognize what is going wrong when a simple read fails.
Learning objectives
By the end of this chapter, you should be able to:
- Pick the right file format for a given situation (and explain why).
- Read a CSV reliably, controlling the delimiter, quoting, encoding, missing-value sentinels, and dtype inference.
- Diagnose and fix the three most common
pd.read_csvfailures. - Read a JSON file into a DataFrame using
pd.read_jsonorjson.load, and handle nested structures withpd.json_normalize. - Read an Excel file, select a specific sheet, and handle merged cells and multi-level headers.
- Explain when Parquet is the right choice and how to read/write it with
pd.read_parquet/df.to_parquet. - Recognize text encoding errors (
UnicodeDecodeError) and recover from them. - Follow a short checklist to validate that a file was loaded correctly before you start analysis.
Running theme: never trust a file you just read
Always verify shape, columns, dtypes, and head/tail of a DataFrame in the cell right after read_*. Most data bugs are actually loading bugs in disguise.
20.1 CSV: the workhorse, and its quirks
A CSV file is a plain-text file where each line is a row and columns are separated by a delimiter — traditionally a comma. CSV is the most common data format you will encounter because it is human-readable, supported everywhere, and easy to generate from any tool.
It is also the most common source of data-loading bugs, for exactly the same reasons: “plain text” and “human-readable” leave a lot of room for inconsistency.
The simple case
import pandas as pd
df = pd.read_csv("sales.csv")When this works, it works great. It rarely works the first time on a file you did not generate yourself. Here are the things pd.read_csv tries to do automatically and the parameters you use when it guesses wrong.
Delimiters other than comma
Many files use tabs, semicolons, or pipes.
df = pd.read_csv("sales.tsv", sep="\t") # tab-separated
df = pd.read_csv("european.csv", sep=";") # common in European data
df = pd.read_csv("weird.csv", sep="|") # pipe-separatedA file with a .csv extension is not guaranteed to be comma-separated. Open the file in a text editor first and look at the first line. If columns are separated by something other than commas, pass sep= explicitly.
You can also let pandas sniff:
df = pd.read_csv("unknown.csv", sep=None, engine="python")but sniffing is slower and occasionally wrong; it is a last resort, not a habit.
Encoding: UnicodeDecodeError
You will eventually hit:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 42: invalid continuation byte
This means the file is not UTF-8. Data from Excel on Windows is often Latin-1 (a.k.a. cp1252). Try:
df = pd.read_csv("legacy.csv", encoding="latin-1")If that fails, try encoding="cp1252", then encoding="utf-16". If none work, the file is probably binary (it is not actually a CSV) or corrupted.
Missing values: the many faces of “nothing”
Data comes with all kinds of sentinels for “this value is missing”:
""(empty string)NA,N/A,n/aNULL,null-,--999,-999,9999(the “numeric sentinel” tradition)- blank cells
pd.read_csv treats a small set (NA, N/A, NULL, empty) as missing by default, but not all. The rest will sneak in as strings and silently poison any numeric operation (see the ValueError: could not convert string to float example in Chapter 7).
Fix it at read time:
df = pd.read_csv("survey.csv", na_values=["-", "--", "999", "n/a", "unknown"])Dtypes: the “everything is a string” trap
If a single “bad” cell sneaks into a numeric column — maybe a footnote or an accidental space — pandas will read the entire column as object (string). You discover this later when df["revenue"].sum() concatenates strings instead of adding numbers.
Two fixes:
# Option 1: handle missing values at read time, as above
df = pd.read_csv("revenue.csv", na_values=["-", ""])
# Option 2: coerce after loading
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")errors="coerce" turns unconvertible cells into NaN, which you can then inspect with df["revenue"].isna().sum().
Always check dtypes after loading:
df.dtypesIf a column you expect to be numeric shows as object, you have a hidden non-number somewhere.
Dates
pandas does not parse dates by default; dates come in as strings. Parse them at read time:
df = pd.read_csv("sales.csv", parse_dates=["date"])If pandas cannot guess the format and you see 2024-13-01 (January 13 in day-first locales) parsed as month 13, pass the format explicitly:
df = pd.read_csv(
"sales.csv",
parse_dates=["date"],
date_format="%d/%m/%Y",
)Headers and index columns
A CSV with no header row:
df = pd.read_csv("data.csv", header=None)A CSV whose first column should become the DataFrame index:
df = pd.read_csv("data.csv", index_col=0)A CSV with metadata garbage above the header (comments, export timestamps, Excel dumps):
df = pd.read_csv("export.csv", skiprows=3) # skip 3 lines, then headerLarge files: do not read what you do not need
If a CSV is bigger than your RAM, do not read it all at once.
# Read only the columns you need
df = pd.read_csv("huge.csv", usecols=["date", "store", "revenue"])
# Read in chunks
for chunk in pd.read_csv("huge.csv", chunksize=100_000):
process(chunk)For files bigger than ~1 GB, also consider Parquet (see section 5) — it is faster and smaller.
20.2 TSV and other delimited formats
A TSV file is a CSV file where the delimiter is a tab. TSVs are marginally better than CSVs because tabs are very rarely used inside cell values, so quoting is less important. Read them with:
df = pd.read_csv("data.tsv", sep="\t")Everything else from the CSV section applies.
20.3 JSON: when the structure is nested
JSON is the dominant format for data returned from web APIs. Unlike CSV, JSON supports nested data — a record can have fields whose values are themselves objects or arrays.
Flat JSON
If the file is a JSON array of flat objects, pd.read_json reads it directly:
df = pd.read_json("users.json")Nested JSON
Real-world API responses are rarely flat. A typical response looks like:
{
"data": [
{"id": 1, "name": "Alice", "address": {"city": "Boulder", "zip": "80301"}},
{"id": 2, "name": "Bob", "address": {"city": "Denver", "zip": "80202"}}
],
"meta": {"page": 1, "total": 2}
}Load it as Python data first, then flatten the list you actually want:
import json
import pandas as pd
with open("users.json") as f:
payload = json.load(f)
df = pd.json_normalize(payload["data"])
# columns: id, name, address.city, address.zippd.json_normalize expands nested dictionaries into dot-separated column names, which is usually what you want.
Newline-delimited JSON (NDJSON / JSONL)
Some data sources write one JSON object per line:
{"id": 1, "name": "Alice"}
{"id": 2, "name": "Bob"}
Read it with:
df = pd.read_json("events.jsonl", lines=True)JSON encoding gotchas
- JSON strings must be double-quoted. Files with single quotes will fail with
json.decoder.JSONDecodeError. - Trailing commas are not allowed in strict JSON. Some tools emit them; you may need to pre-process.
- JSON has no native date type — dates are always strings. Convert after loading with
pd.to_datetime(df["date"]).
20.4 Excel: the format you cannot escape
Excel files (.xlsx, .xls) are not plain text. They are binary (or XML-in-a-zip for .xlsx) and contain one or more sheets, each with its own rows, columns, formulas, formatting, and occasional merged cells. pandas reads them via openpyxl (which you may need to install):
python -m pip install openpyxlBasic usage
df = pd.read_excel("report.xlsx") # reads the first sheetMultiple sheets
df = pd.read_excel("report.xlsx", sheet_name="Sales") # one sheet by name
df = pd.read_excel("report.xlsx", sheet_name=0) # one sheet by index
sheets = pd.read_excel("report.xlsx", sheet_name=None) # dict of all sheets
sheets = pd.read_excel("report.xlsx", sheet_name=["Sales","Q4"])# dict of some sheetsWhen you pass sheet_name=None, you get a dict mapping sheet name to DataFrame. Iterate or pick.
Dealing with messy Excel layouts
Excel files are often designed for humans, not computers. Common issues:
Header rows buried below a title and blank rows. Use
skiprows=:df = pd.read_excel("report.xlsx", skiprows=4)Multi-level column headers (two header rows for grouped columns):
df = pd.read_excel("report.xlsx", header=[0, 1])Merged cells. pandas unmerges them and fills only the top-left cell with the value; other cells become
NaN. You may need to forward-fill:df["group"] = df["group"].ffill().Formulas. pandas reads the last-computed value of a formula cell, not the formula. If Excel did not compute the formulas before saving, you get
NaN.
Save back to Excel
df.to_excel("cleaned.xlsx", index=False)Always pass index=False unless you explicitly want the row index as a column.
20.5 Parquet: the format for real data work
Parquet is a binary, columnar format designed for analytical data. It is the right answer for datasets bigger than a few hundred MB or for data you load repeatedly. You will need pyarrow installed:
python -m pip install pyarrowReading and writing
df = pd.read_parquet("sales.parquet")
df.to_parquet("sales.parquet")Why Parquet instead of CSV
- Smaller. Parquet compresses columns individually. Expect 5–10× smaller than the equivalent CSV.
- Faster. Reading a Parquet file is 10–50× faster than reading a CSV of the same data, because it is binary and column-oriented.
- Dtype-preserving. Parquet stores dtypes natively — no more “why did my date column become a string?”
- Column selection is free.
pd.read_parquet("f.parquet", columns=["date", "revenue"])reads only those columns from disk. CSVs must read the whole file and then drop columns in memory.
When not to use Parquet
- You need to open the file in a spreadsheet or text editor. Parquet is binary, so editors show garbage.
- You are handing the file to a collaborator who does not have pyarrow installed.
- The dataset is tiny (a few hundred rows) — the CSV is fine and everyone understands it.
A common pattern: keep the raw data in CSV (or whatever you received), convert to Parquet once at the start of your pipeline, and do all subsequent reads from Parquet.
raw = pd.read_csv("raw/sales.csv", parse_dates=["date"], na_values=["-", "N/A"])
raw.to_parquet("intermediate/sales.parquet")
# From now on, every notebook starts with:
df = pd.read_parquet("intermediate/sales.parquet")20.6 Text encoding in general
If you remember only one thing about encoding, remember this: encoding is the rule for turning bytes into characters. UTF-8 is the standard for modern text and should be your default. Files that are not UTF-8 are usually Latin-1 (or cp1252 on Windows), which covers Western European characters but not Greek, Cyrillic, Chinese, etc.
To check a file’s encoding, on macOS/Linux:
file -i data.csvOn Python, if you are unsure:
with open("data.csv", "rb") as f:
raw = f.read(1000)
raw[:100] # look at the raw bytesUnusual bytes like \xe9 (é in Latin-1, invalid in UTF-8) are the telltale of a non-UTF-8 file.
When you write a file yourself, always write UTF-8:
df.to_csv("out.csv", index=False, encoding="utf-8")20.7 Stakes and politics
Data file formats look like neutral containers, but each one encodes a worldview about what data is and who it is for. Three things to notice. First, what counts as “tabular.” CSV, Excel, and Parquet all assume your data fits naturally into rows and columns of fixed width — the worldview of accountants, statisticians, and relational databases. JSON and XML allow nesting and so admit shapes (trees, graphs, ragged records) the tabular formats cannot represent without flattening. Whichever format you pick, you are pre-committing to a shape, and information that does not fit gets lost or distorted in the conversion.
Second, whose languages and characters were the spec written for. CSV’s RFC 4180 says nothing about encoding; in practice “CSV” still means CP1252 on many Windows machines, Latin-1 on some legacy Unix systems, and UTF-8 on the rest, and a name with a ñ or a ü round-trips correctly only when every step in the chain agrees. The 2010s convergence on UTF-8 is real progress — but a non-trivial amount of public-sector and corporate data still ships in encodings designed for English and Western European text, with predictable consequences for everyone else’s names. Third, who can read each format without paying. Excel’s .xlsx is a documented Open XML standard, but the most reliable readers are Microsoft’s; Parquet is open but mostly written and read by tools tied to Hadoop/Spark/cloud-warehouse pipelines that smaller teams cannot afford to run.
See Chapter 8 for the broader framework. The concrete prompt to carry forward: when you choose or accept a data format, ask whose data shapes it serves cleanly and whose it forces you to mangle.
20.8 Worked examples
A “normal” CSV that is not normal
You run:
df = pd.read_csv("sales.csv")and get:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 237, saw 6
Diagnosis: line 237 has an extra field. Usually this means a cell in that row contains a comma that was not properly quoted, or the CSV has an inconsistent number of columns per row.
Fix: open the file in a text editor, jump to line 237, and inspect. You will probably find a value like "Acme, Inc." that was written without quotes. If you cannot fix the source, read with a more forgiving parser:
df = pd.read_csv("sales.csv", engine="python", on_bad_lines="warn")This logs the bad lines and skips them, which is better than failing silently but worse than fixing the source.
Silent missing-value corruption
You load a temperature dataset:
df = pd.read_csv("weather.csv")
df["temp_c"].mean()and get:
ValueError: could not convert string to float: '-999'
Diagnosis: -999 is a common “no reading” sentinel in weather data. pandas read the column as string because of it.
Fix:
df = pd.read_csv("weather.csv", na_values=["-999"])
df["temp_c"].mean() # now works, missing values are NaNNested JSON from a web API
You downloaded a GitHub API response:
import json
with open("repos.json") as f:
data = json.load(f)
df = pd.json_normalize(data, record_path="items", meta=["total_count"])record_path="items" tells pandas which nested list to flatten into rows; meta=["total_count"] adds the top-level metadata as extra columns on every row.
20.9 Templates
A defensive read_csv that handles most of the common quirks:
df = pd.read_csv(
"data.csv",
sep=",",
encoding="utf-8",
na_values=["-", "--", "N/A", "NA", "n/a", "NULL", "null", "999", "-999"],
parse_dates=["date"], # adjust to your date columns
dtype={"sku": "string"}, # force string dtype for columns you know are strings
)A validation snippet to run after every load:
print("shape:", df.shape)
print("columns:", df.columns.tolist())
print(df.dtypes)
print(df.head())
print("nulls per column:")
print(df.isna().sum())20.10 Exercises
- Take a CSV file from a real data source (a government open-data portal, a Kaggle dataset, or your course). Open it in a text editor, note the delimiter, the header row, and any obviously-missing-value sentinels. Then load it with
pd.read_csv, passing the correct parameters the first time. - Deliberately save a CSV with
encoding="latin-1"(e.g., a file with accented characters). Try to read it with the default UTF-8 and observe theUnicodeDecodeError. Then read it correctly. - Find a dataset with numeric sentinels for missing values (temperature, survey responses). Load it both ways — once without
na_values, once with — and run.describe()on both. Compare. - Load a nested JSON file (e.g., a GitHub API response or a tweet dump). Use
pd.json_normalizeto flatten the records you care about. - Load an Excel file with multiple sheets. Use
sheet_name=Noneto get a dict, then loop over it to print the shape of each sheet. - Convert a CSV you use often into Parquet. Compare file sizes and load times (
%time df = pd.read_csv(...)vs%time df = pd.read_parquet(...)). - Write the “validate after load” snippet from section 8 as a reusable function
validate(df)that prints the report. Put it in a module you can import from any notebook.
20.11 One-page checklist
- Open unfamiliar CSVs in a text editor first; note the delimiter, encoding, and header layout.
- Default to UTF-8 encoding; fall back to Latin-1 / cp1252 only when needed.
- Pass
na_values=liberally; domain data has its own missing-value sentinels. - Use
parse_dates=at read time instead of converting later. - Check
df.shape,df.columns,df.dtypes, anddf.head()in the cell right after everyread_*. - If a numeric column shows up as
object, you have hidden strings. Usepd.to_numeric(..., errors="coerce")to find them. - Use Parquet for intermediate files and anything over ~100 MB.
- Always pass
index=Falsewhen writing a CSV or Excel file unless you want the row index as a column. - When in doubt,
df.head()anddf.tail()and trust your eyes over your assumptions.
- pandas,
read_csvreference — the complete parameter list with notes on each option; thena_values,parse_dates,dtype, andencodingarguments alone reward a careful read. - IETF, RFC 4180: Common Format and MIME Type for CSV Files — the short, official CSV specification; useful for settling arguments about quoting and line endings.
- JSON specification (json.org) — the canonical explanation of JSON with syntax diagrams.
- Apache, Parquet documentation — the columnar format used for any dataset large enough that CSV becomes painful; the file-format spec is short and worth skimming.
- Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets — a 2003 essay that is still the best 30-minute introduction to encodings; required reading after your first
UnicodeDecodeError. - ECMA International, Office Open XML File Formats (
.xlsx) — the standard underneath Excel files; useful when you want to know what is actually in a.xlsxfile (a zip of XML). - Tom Augspurger and others, Modern Pandas: Tidy Data and File I/O — practitioner-focused notes on reading messy real-world tabular data; pairs well with the
read_csvreference.