20 Data File Formats

Prerequisites and see-also

Prerequisites (read first if unfamiliar): Chapter 10, Chapter 17.

See also: Chapter 16, Chapter 14, Chapter 7.

Purpose

Noah Meme: CSV, JSON, Parquet, What is this?

A huge fraction of your time in a first data science course is spent moving data from a file into a pandas DataFrame. On paper, this is a one-liner: pd.read_csv("data.csv"). In practice, real-world files introduce an astonishing number of ways that simple line can go wrong — wrong delimiter, broken encoding, missing-value sentinels that sneak in as strings, dates parsed as objects, silently duplicated rows, multi-level headers, Excel sheets that aren’t sheets, JSON that is almost but not quite valid.

This chapter is a field guide to the five file formats you will meet most often as a novice data scientist — CSV, TSV, JSON, Excel, and Parquet — with the specific quirks of each, the pandas reader invocation that will actually work, and the debugging steps for when it does not. The goal is not to make you a file format expert but to give you enough mental model to recognize what is going wrong when a simple read fails.

Learning objectives

By the end of this chapter, you should be able to:

Pick the right file format for a given situation (and explain why).
Read a CSV reliably, controlling the delimiter, quoting, encoding, missing-value sentinels, and dtype inference.
Diagnose and fix the three most common pd.read_csv failures.
Read a JSON file into a DataFrame using pd.read_json or json.load, and handle nested structures with pd.json_normalize.
Read an Excel file, select a specific sheet, and handle merged cells and multi-level headers.
Explain when Parquet is the right choice and how to read/write it with pd.read_parquet / df.to_parquet.
Recognize text encoding errors (UnicodeDecodeError) and recover from them.
Follow a short checklist to validate that a file was loaded correctly before you start analysis.

Running theme: never trust a file you just read

Always verify shape, columns, dtypes, and head/tail of a DataFrame in the cell right after read_*. Most data bugs are actually loading bugs in disguise.

20.1 CSV: the workhorse, and its quirks

A CSV file is a plain-text file where each line is a row and columns are separated by a delimiter — traditionally a comma. CSV is the most common data format you will encounter because it is human-readable, supported everywhere, and easy to generate from any tool.

It is also the most common source of data-loading bugs, for exactly the same reasons: “plain text” and “human-readable” leave a lot of room for inconsistency.

The simple case

import pandas as pd
df = pd.read_csv("sales.csv")

When this works, it works great. It rarely works the first time on a file you did not generate yourself. Here are the things pd.read_csv tries to do automatically and the parameters you use when it guesses wrong.

Delimiters other than comma

Many files use tabs, semicolons, or pipes.

df = pd.read_csv("sales.tsv", sep="\t")          # tab-separated
df = pd.read_csv("european.csv", sep=";")        # common in European data
df = pd.read_csv("weird.csv", sep="|")           # pipe-separated

A file with a .csv extension is not guaranteed to be comma-separated. Open the file in a text editor first and look at the first line. If columns are separated by something other than commas, pass sep= explicitly.

You can also let pandas sniff:

df = pd.read_csv("unknown.csv", sep=None, engine="python")

but sniffing is slower and occasionally wrong; it is a last resort, not a habit.

Encoding: `UnicodeDecodeError`

You will eventually hit:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 42: invalid continuation byte

This means the file is not UTF-8. Data from Excel on Windows is often Latin-1 (a.k.a. cp1252). Try:

df = pd.read_csv("legacy.csv", encoding="latin-1")

If that fails, try encoding="cp1252", then encoding="utf-16". If none work, the file is probably binary (it is not actually a CSV) or corrupted.

Missing values: the many faces of “nothing”

Data comes with all kinds of sentinels for “this value is missing”:

"" (empty string)
NA, N/A, n/a
NULL, null
-, --
999, -999, 9999 (the “numeric sentinel” tradition)
blank cells

pd.read_csv treats a small set (NA, N/A, NULL, empty) as missing by default, but not all. The rest will sneak in as strings and silently poison any numeric operation (see the ValueError: could not convert string to float example in Chapter 7).

Fix it at read time:

df = pd.read_csv("survey.csv", na_values=["-", "--", "999", "n/a", "unknown"])

Dtypes: the “everything is a string” trap

If a single “bad” cell sneaks into a numeric column — maybe a footnote or an accidental space — pandas will read the entire column as object (string). You discover this later when df["revenue"].sum() concatenates strings instead of adding numbers.

Two fixes:

# Option 1: handle missing values at read time, as above
df = pd.read_csv("revenue.csv", na_values=["-", ""])

# Option 2: coerce after loading
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

errors="coerce" turns unconvertible cells into NaN, which you can then inspect with df["revenue"].isna().sum().

Always check dtypes after loading:

df.dtypes

If a column you expect to be numeric shows as object, you have a hidden non-number somewhere.

Dates

pandas does not parse dates by default; dates come in as strings. Parse them at read time:

df = pd.read_csv("sales.csv", parse_dates=["date"])

If pandas cannot guess the format and you see 2024-13-01 (January 13 in day-first locales) parsed as month 13, pass the format explicitly:

df = pd.read_csv(
    "sales.csv",
    parse_dates=["date"],
    date_format="%d/%m/%Y",
)

Headers and index columns

A CSV with no header row:

df = pd.read_csv("data.csv", header=None)

A CSV whose first column should become the DataFrame index:

df = pd.read_csv("data.csv", index_col=0)

A CSV with metadata garbage above the header (comments, export timestamps, Excel dumps):

df = pd.read_csv("export.csv", skiprows=3)       # skip 3 lines, then header

Large files: do not read what you do not need

If a CSV is bigger than your RAM, do not read it all at once.

# Read only the columns you need
df = pd.read_csv("huge.csv", usecols=["date", "store", "revenue"])

# Read in chunks
for chunk in pd.read_csv("huge.csv", chunksize=100_000):
    process(chunk)

For files bigger than ~1 GB, also consider Parquet (see section 5) — it is faster and smaller.

20.2 TSV and other delimited formats

A TSV file is a CSV file where the delimiter is a tab. TSVs are marginally better than CSVs because tabs are very rarely used inside cell values, so quoting is less important. Read them with:

df = pd.read_csv("data.tsv", sep="\t")

Everything else from the CSV section applies.

20.3 JSON: when the structure is nested

JSON is the dominant format for data returned from web APIs. Unlike CSV, JSON supports nested data — a record can have fields whose values are themselves objects or arrays.

Flat JSON

If the file is a JSON array of flat objects, pd.read_json reads it directly:

df = pd.read_json("users.json")

Nested JSON

Real-world API responses are rarely flat. A typical response looks like:

{
  "data": [
    {"id": 1, "name": "Alice", "address": {"city": "Boulder", "zip": "80301"}},
    {"id": 2, "name": "Bob",   "address": {"city": "Denver",  "zip": "80202"}}
  ],
  "meta": {"page": 1, "total": 2}
}

Load it as Python data first, then flatten the list you actually want:

import json
import pandas as pd

with open("users.json") as f:
    payload = json.load(f)

df = pd.json_normalize(payload["data"])
# columns: id, name, address.city, address.zip

pd.json_normalize expands nested dictionaries into dot-separated column names, which is usually what you want.

Newline-delimited JSON (NDJSON / JSONL)

Some data sources write one JSON object per line:

{"id": 1, "name": "Alice"}
{"id": 2, "name": "Bob"}

Read it with:

df = pd.read_json("events.jsonl", lines=True)

JSON encoding gotchas

JSON strings must be double-quoted. Files with single quotes will fail with json.decoder.JSONDecodeError.
Trailing commas are not allowed in strict JSON. Some tools emit them; you may need to pre-process.
JSON has no native date type — dates are always strings. Convert after loading with pd.to_datetime(df["date"]).

20.4 Excel: the format you cannot escape

Excel files (.xlsx, .xls) are not plain text. They are binary (or XML-in-a-zip for .xlsx) and contain one or more sheets, each with its own rows, columns, formulas, formatting, and occasional merged cells. pandas reads them via openpyxl (which you may need to install):

python -m pip install openpyxl

Basic usage

df = pd.read_excel("report.xlsx")              # reads the first sheet

Multiple sheets

df = pd.read_excel("report.xlsx", sheet_name="Sales")           # one sheet by name
df = pd.read_excel("report.xlsx", sheet_name=0)                 # one sheet by index
sheets = pd.read_excel("report.xlsx", sheet_name=None)          # dict of all sheets
sheets = pd.read_excel("report.xlsx", sheet_name=["Sales","Q4"])# dict of some sheets

When you pass sheet_name=None, you get a dict mapping sheet name to DataFrame. Iterate or pick.

Dealing with messy Excel layouts

Excel files are often designed for humans, not computers. Common issues:

Header rows buried below a title and blank rows. Use skiprows=:
```
df = pd.read_excel("report.xlsx", skiprows=4)
```
Multi-level column headers (two header rows for grouped columns):
```
df = pd.read_excel("report.xlsx", header=[0, 1])
```
Merged cells. pandas unmerges them and fills only the top-left cell with the value; other cells become NaN. You may need to forward-fill: df["group"] = df["group"].ffill().
Formulas. pandas reads the last-computed value of a formula cell, not the formula. If Excel did not compute the formulas before saving, you get NaN.

Save back to Excel

df.to_excel("cleaned.xlsx", index=False)

Always pass index=False unless you explicitly want the row index as a column.

20.5 Parquet: the format for real data work

Parquet is a binary, columnar format designed for analytical data. It is the right answer for datasets bigger than a few hundred MB or for data you load repeatedly. You will need pyarrow installed:

python -m pip install pyarrow

Reading and writing

df = pd.read_parquet("sales.parquet")
df.to_parquet("sales.parquet")

Why Parquet instead of CSV

Smaller. Parquet compresses columns individually. Expect 5–10× smaller than the equivalent CSV.
Faster. Reading a Parquet file is 10–50× faster than reading a CSV of the same data, because it is binary and column-oriented.
Dtype-preserving. Parquet stores dtypes natively — no more “why did my date column become a string?”
Column selection is free. pd.read_parquet("f.parquet", columns=["date", "revenue"]) reads only those columns from disk. CSVs must read the whole file and then drop columns in memory.

When not to use Parquet

You need to open the file in a spreadsheet or text editor. Parquet is binary, so editors show garbage.
You are handing the file to a collaborator who does not have pyarrow installed.
The dataset is tiny (a few hundred rows) — the CSV is fine and everyone understands it.

A common pattern: keep the raw data in CSV (or whatever you received), convert to Parquet once at the start of your pipeline, and do all subsequent reads from Parquet.

raw = pd.read_csv("raw/sales.csv", parse_dates=["date"], na_values=["-", "N/A"])
raw.to_parquet("intermediate/sales.parquet")

# From now on, every notebook starts with:
df = pd.read_parquet("intermediate/sales.parquet")

20.6 Text encoding in general

If you remember only one thing about encoding, remember this: encoding is the rule for turning bytes into characters. UTF-8 is the standard for modern text and should be your default. Files that are not UTF-8 are usually Latin-1 (or cp1252 on Windows), which covers Western European characters but not Greek, Cyrillic, Chinese, etc.

To check a file’s encoding, on macOS/Linux:

file -i data.csv

On Python, if you are unsure:

with open("data.csv", "rb") as f:
    raw = f.read(1000)
raw[:100]   # look at the raw bytes

Unusual bytes like \xe9 (é in Latin-1, invalid in UTF-8) are the telltale of a non-UTF-8 file.

When you write a file yourself, always write UTF-8:

df.to_csv("out.csv", index=False, encoding="utf-8")

20.7 Stakes and politics

Data file formats look like neutral containers, but each one encodes a worldview about what data is and who it is for. Three things to notice. First, what counts as “tabular.” CSV, Excel, and Parquet all assume your data fits naturally into rows and columns of fixed width — the worldview of accountants, statisticians, and relational databases. JSON and XML allow nesting and so admit shapes (trees, graphs, ragged records) the tabular formats cannot represent without flattening. Whichever format you pick, you are pre-committing to a shape, and information that does not fit gets lost or distorted in the conversion.

Second, whose languages and characters were the spec written for. CSV’s RFC 4180 says nothing about encoding; in practice “CSV” still means CP1252 on many Windows machines, Latin-1 on some legacy Unix systems, and UTF-8 on the rest, and a name with a ñ or a ü round-trips correctly only when every step in the chain agrees. The 2010s convergence on UTF-8 is real progress — but a non-trivial amount of public-sector and corporate data still ships in encodings designed for English and Western European text, with predictable consequences for everyone else’s names. Third, who can read each format without paying. Excel’s .xlsx is a documented Open XML standard, but the most reliable readers are Microsoft’s; Parquet is open but mostly written and read by tools tied to Hadoop/Spark/cloud-warehouse pipelines that smaller teams cannot afford to run.

See Chapter 8 for the broader framework. The concrete prompt to carry forward: when you choose or accept a data format, ask whose data shapes it serves cleanly and whose it forces you to mangle.

20.8 Worked examples

A “normal” CSV that is not normal

You run:

df = pd.read_csv("sales.csv")

and get:

ParserError: Error tokenizing data. C error: Expected 5 fields in line 237, saw 6

Diagnosis: line 237 has an extra field. Usually this means a cell in that row contains a comma that was not properly quoted, or the CSV has an inconsistent number of columns per row.

Fix: open the file in a text editor, jump to line 237, and inspect. You will probably find a value like "Acme, Inc." that was written without quotes. If you cannot fix the source, read with a more forgiving parser:

df = pd.read_csv("sales.csv", engine="python", on_bad_lines="warn")

This logs the bad lines and skips them, which is better than failing silently but worse than fixing the source.

Silent missing-value corruption

You load a temperature dataset:

df = pd.read_csv("weather.csv")
df["temp_c"].mean()