24  HTTP and Web APIs

Prerequisites (read first if unfamiliar): Chapter 14, Chapter 17.

See also: Chapter 20, Chapter 34, Chapter 7.

Purpose

Hagrid Meme: Tried to scrape a website, got the entire campus blocked.

A surprising amount of real data science starts with fetching data from a URL. Weather records, stock prices, GitHub issues, Wikipedia edits, census statistics, and a long tail of research datasets live behind HTTP APIs — you send a request, the server sends back JSON or CSV, and you parse it into a DataFrame. Knowing how to do this cleanly is the difference between “I could only use the datasets my instructor handed me” and “I can get data for any project I care about.”

This chapter covers the minimum you need to fetch data over HTTP responsibly and reliably: what an HTTP request is, how to use the requests library, how to deal with JSON responses, how to handle errors and rate limits, how to carry an API key without leaking it, and how to be a good citizen on someone else’s server. You will not leave this chapter a backend engineer — that is a whole different discipline — but you will leave able to fetch data from 90% of the APIs you encounter as a student.

Learning objectives

By the end of this chapter, you should be able to:

  1. Explain what HTTP is in one paragraph, and name the four verbs you will use most (GET, POST, PUT, DELETE).
  2. Read an HTTP status code and know whether your request succeeded, failed because of you, or failed because of the server.
  3. Use the requests library to make GET and POST requests with query parameters, headers, and a body.
  4. Parse a JSON response into a Python dict and then into a pandas DataFrame.
  5. Pass an API key via a header without committing it to git.
  6. Handle errors gracefully: timeouts, 4xx/5xx responses, rate limits, and network failures.
  7. Respect rate limits, User-Agent headers, and robots.txt.
  8. Recognize when you should use an official SDK, a CSV download, or a database dump instead of an API.

Running theme: the network is slow, broken, and rude — plan for it

Assume every HTTP request can time out, rate-limit you, return garbage, or fail silently. Good HTTP code has timeouts, status checks, and retries. Bad HTTP code crashes halfway through a ten-minute loop.

24.1 HTTP in one paragraph

The web runs on HTTP (Hypertext Transfer Protocol). A client (your Python script, a browser, curl) sends a request to a server: a method (GET, POST, …), a URL (https://api.example.com/v1/users/42), optional headers (metadata), and optional body (data). The server replies with a response: a status code (200 OK, 404 Not Found, 500 Server Error), headers, and a body (usually HTML, JSON, or binary data). That is the whole protocol for our purposes.

The four HTTP verbs

You will use these four most often:

Verb Purpose
GET fetch data — “give me this resource”
POST create or submit data — “here is something new”
PUT update a resource in place — “replace this with that”
DELETE delete a resource — “get rid of this”

For data fetching as a student, you will use GET 95% of the time and POST for the handful of APIs that want you to submit a query.

Status codes

Status codes fall into five ranges. The first digit tells you the category:

Range Meaning
1xx informational (rare — you usually don’t see these)
2xx success (200 OK is the standard “all good”)
3xx redirect (the resource moved; requests follows these by default)
4xx your fault — bad request, missing auth, wrong URL
5xx server’s fault — the API is broken or down

Memorize these five:

  • 200 OK — worked.
  • 401 Unauthorized — missing or bad credentials.
  • 403 Forbidden — credentials were accepted but you are not allowed to do that.
  • 404 Not Found — URL is wrong, or the resource does not exist.
  • 429 Too Many Requests — you are being rate-limited. Slow down.
  • 500 Internal Server Error — the server broke. Not your fault; try again later.

24.2 Quick fetches with curl and wget

Before you write any Python, the fastest way to confirm that an API actually works is to call it from the command line. Two tools are universally available: curl and wget. They overlap but are good at different things, and a working knowledge of both pays for itself the first time you have to debug an API at 3 AM with no Python interpreter in sight.

curl for inspecting

curl is a Swiss Army knife for HTTP. The bare invocation curl <url> sends a GET request and prints the response body to your terminal. That is enough for most quick checks:

curl https://api.github.com/repos/pandas-dev/pandas

The flag worth memorizing first is -i, which includes the response headers (status line, content-type, rate-limit info) in the output. When you are debugging “is the API actually responding?” or “what status code did I get?”, -i is the answer:

$ curl -i https://api.github.com/repos/pandas-dev/pandas | head
HTTP/2 200
content-type: application/json; charset=utf-8
x-ratelimit-limit: 60
x-ratelimit-remaining: 58
...

Three more flags handle most situations. -H 'Header: value' adds a request header — use it to send your User-Agent or an Authorization: Bearer <token>. -X POST changes the HTTP method (the default is GET); pair it with -d '...' to send a body. And -o filename writes the response body to a file instead of stdout. Strung together:

curl -X POST https://api.example.com/items \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "widget", "qty": 12}'

Two more flags are worth knowing for scripts. -s silences the progress meter (essential when piping curl into another command), and -f makes curl exit with a nonzero status code on any HTTP 4xx or 5xx response — without it, curl happily writes the server’s HTML error page to your output file and reports success.

For pretty-printing JSON responses while you explore, pipe curl into jq, which is the de facto JSON command-line processor:

curl -s https://api.github.com/repos/pandas-dev/pandas | jq '.stargazers_count'

wget for downloading

wget is a download tool first and an HTTP client second. Where curl is built for ad-hoc API inspection, wget is built for “fetch this URL to a file, reliably, possibly across a flaky network.” Two features make it the better choice for bulk downloads.

The first is resumable transfers: wget -c <url> continues a partial download instead of restarting it from byte zero. For a multi-gigabyte dataset on a slow connection, this is the difference between “I have to babysit this all night” and “I can leave it running.”

The second is recursive mirroring: wget -r <url> follows every link on a page and downloads everything it finds, subject to depth and same-host limits. This is occasionally useful for archiving a small documentation site, and it is wildly inappropriate against most modern web infrastructure — see “Respect robots.txt” later in this chapter, and never recursively wget a site you do not own without checking the robots policy first.

For most everyday use, the invocation is short:

wget https://example.com/dataset.csv               # save as dataset.csv
wget -O sales.csv https://example.com/data.csv     # save under a chosen name
wget -c https://example.com/big-file.zip           # resume if interrupted

When to use which

A practical rule of thumb. Use curl when you are exploring an API — checking what it returns, testing auth, debugging headers, prototyping a request you will eventually translate into Python. Use wget when you are downloading a file and want resilience features like resume and retry. Use requests (the rest of this chapter) when you are writing code that has to fetch data as part of a larger pipeline. The three tools are not competitors; they are different points on the spectrum from “ad-hoc inspection” to “production code.”

A hidden benefit of starting with curl: many APIs include curl invocations in their documentation as the canonical example. Being able to read those examples directly — instead of mentally translating them into requests first — speeds up your reading of API docs by a noticeable amount.

24.3 The requests library

The requests library is the de facto standard for HTTP in Python. It is not in the standard library, so install it:

python -m pip install requests

The simplest possible use:

Figure 24.1: ALT: Web browser showing the raw JSON response from a public API (for example, https://jsonplaceholder.typicode.com/users/1). The response is a formatted object with name, email, and address fields, illustrating what API data looks like before any Python parsing.

The three most common API failures each have a distinctive signature. The request hangs forever — you forgot to pass a timeout= argument to requests.get(), and the default is no timeout. Add timeout=10 to every request. The response has status 401 or 403 — your API key is missing, wrong, or being sent in the wrong header. Check the API’s docs for the exact header name (usually Authorization: Bearer <key> or X-API-Key: <key>) and confirm the key is loaded correctly from your .env file (see Chapter 34). The response has status 429 — the server is rate-limiting you. Slow down: add time.sleep() between requests, or check for a Retry-After header that tells you how long to wait.

For any 4xx error, print(response.text) before parsing — the server almost always returns a helpful error message in the body that explains what it rejected.

import requests

resp = requests.get("https://api.github.com/repos/pandas-dev/pandas")
print(resp.status_code)        # 200
print(resp.json()["stargazers_count"])

resp.json() parses the JSON body into a Python dict (or list). For non-JSON responses, use resp.text (string) or resp.content (bytes).

Query parameters

URLs with ?foo=1&bar=2 are how you pass parameters to most GET endpoints. Do not concatenate them by hand — pass a dict:

resp = requests.get(
    "https://api.example.com/search",
    params={"q": "pandas", "limit": 50},
)
# requests builds the final URL: ...search?q=pandas&limit=50

requests handles URL-encoding of special characters automatically, which is important for queries that contain spaces or & or =.

Headers

Headers are metadata about your request. The two you will use most are Authorization (for API keys / tokens) and User-Agent (for identifying your script politely).

headers = {
    "Authorization": f"Bearer {api_token}",
    "User-Agent": "my-term-project/0.1 (contact: alice@example.com)",
    "Accept": "application/json",
}
resp = requests.get(url, headers=headers)

Many APIs reject requests with no User-Agent or with a default one. Send a descriptive one so server operators can reach you if your script misbehaves.

POST with a body

resp = requests.post(
    "https://api.example.com/users",
    json={"name": "Alice", "email": "alice@example.com"},
)

Using json= serializes the dict into JSON and sets Content-Type: application/json automatically. You rarely need data= (form-encoded) for modern APIs.

Timeouts are not optional

Always pass a timeout — otherwise a hung server can freeze your script forever.

resp = requests.get(url, timeout=10)   # fail after 10 seconds

10 seconds is reasonable for most APIs. Set it shorter for fast APIs and longer for reports or exports.

24.4 Checking the response

requests will not raise an exception on a 404 or 500 by default — you have to check. Two common idioms:

# Idiom 1: explicit status check
resp = requests.get(url, timeout=10)
if resp.status_code != 200:
    raise RuntimeError(f"API returned {resp.status_code}: {resp.text[:200]}")

# Idiom 2: raise_for_status (raises on 4xx/5xx)
resp = requests.get(url, timeout=10)
resp.raise_for_status()
data = resp.json()

Use raise_for_status() when you just want to abort on any error. Use the explicit check when you want different behavior for different codes (e.g., retry on 429 but abort on 404).

24.5 JSON → DataFrame

Most APIs return JSON. For tabular data the pattern is usually:

import pandas as pd
import requests

resp = requests.get(
    "https://api.example.com/v1/sales",
    params={"from": "2024-01-01", "to": "2024-03-31"},
    timeout=10,
)
resp.raise_for_status()
payload = resp.json()

# A flat list of records:
df = pd.DataFrame(payload)

# Or if the list is nested under a key:
df = pd.DataFrame(payload["data"])

# Or for deeply nested JSON, see @sec-data-file-formats
df = pd.json_normalize(payload["results"])

Always inspect payload in a REPL or notebook cell before assuming it is shaped like you expect. The first couple times you call a new API, print(payload) and print(type(payload)) are cheaper than wrong code.

24.6 API keys and secrets

Most useful APIs require authentication. The key goes in a header, usually as Authorization: Bearer <token> or in a custom header like X-API-Key.

Never hardcode a key in your script. Never commit one to git. Use an environment variable instead:

import os
import requests

api_key = os.environ["OPENWEATHER_API_KEY"]   # KeyError if missing — good
resp = requests.get(
    "https://api.openweathermap.org/data/2.5/weather",
    params={"q": "Boulder,US", "appid": api_key},
    timeout=10,
)

Store the key in a .env file and load it with python-dotenv. The full workflow — why, how, and how to avoid leaking keys into git — is Chapter 34. Read it before building anything serious.

24.7 Rate limits and being polite

Most APIs limit how fast you can call them — often “N requests per minute” or “N requests per day.” When you exceed the limit, you get a 429 Too Many Requests. Some APIs return a Retry-After header telling you how many seconds to wait.

Simple rate limiting

For a modest script, just sleep between requests:

import time

for item_id in ids:
    resp = requests.get(f"https://api.example.com/items/{item_id}", timeout=10)
    resp.raise_for_status()
    process(resp.json())
    time.sleep(0.2)      # 5 requests per second

Retry on 429 with backoff

import time

def get_with_retry(url, headers=None, params=None, max_retries=5):
    for attempt in range(max_retries):
        resp = requests.get(url, headers=headers, params=params, timeout=10)
        if resp.status_code == 429:
            wait = int(resp.headers.get("Retry-After", 2 ** attempt))
            time.sleep(wait)
            continue
        resp.raise_for_status()
        return resp
    raise RuntimeError(f"Gave up after {max_retries} attempts")

2 ** attempt gives exponential backoff (1s, 2s, 4s, 8s, 16s), which is how polite clients handle overloaded servers.

Respect robots.txt

For web pages (not APIs), check https://example.com/robots.txt before scraping. It tells you which paths the site owner is OK with automated tools hitting. Ignoring it is rude and can get your IP blocked.

24.8 When not to use requests directly

Before you write a lot of custom API code, check:

  • Does an SDK exist? Many big APIs ship an official Python library: githubPyGithub, slackslack_sdk, AWS → boto3, Google → google-*. SDKs handle auth, rate limits, pagination, and error types for you. Use them when they exist.
  • Is there a bulk download? Many data sources also offer CSV, Parquet, or SQL dumps of the same data. If you want a snapshot, the dump is usually much faster and easier than hammering the API row by row. See Chapter 20.
  • Is this a one-off? For a single fetch, curl on the command line or a browser download might be faster than a Python script. Save the result to disk and work from there.

24.9 Stakes and politics

A web API is a controlled door into someone else’s data. The mechanics in this chapter — keys, rate limits, authentication, retries — are the door’s hardware, and noticing them as hardware is most of the political move.

Three things to notice. First, the data provider sets the terms. They choose which fields the API exposes, which historical records remain accessible, what counts as “fair use,” and how often they will deprecate endpoints. When Twitter became X and revoked academic API access in 2023, dozens of long-running research projects collapsed overnight; when Reddit moved to paid API access later that year, third-party clients shut down and the moderation tools many subreddits relied on stopped working. Endpoints are not infrastructure; they are corporate decisions held in place until they are not. Second, rate limits and pricing tiers concentrate access. A free tier that allows 60 requests per hour is enough for a class assignment and useless for any analysis at scale. Paid tiers exist precisely to filter who can ask which questions, and the price is set by the provider’s business model, not by the cost of serving the bytes. Third, the legal gradient between API and scrape is real. APIs come with terms of service that explicitly authorize the access you are doing; web scraping the same data may or may not be lawful depending on the CFAA, the DMCA, the site’s terms, and which jurisdiction you sit in. The technical bar to scraping is low; the legal bar can be unexpectedly high — Aaron Swartz’s case is the cautionary one.

See Chapter 8 for the broader framework, including the CFAA and DMCA history. The concrete prompt to carry forward: when you build a project on someone else’s API, ask what happens when the provider changes the rules — because they will.

24.10 Worked examples

Fetch a GitHub repo’s metadata

import requests
import pandas as pd

resp = requests.get(
    "https://api.github.com/repos/pandas-dev/pandas",
    headers={"User-Agent": "term-project/0.1"},
    timeout=10,
)
resp.raise_for_status()
repo = resp.json()

print(f"{repo['full_name']}: {repo['stargazers_count']:,} stars")
print(f"Last updated: {repo['updated_at']}")

Paginate through results

Most list endpoints return a few dozen items per page and require you to request subsequent pages.

def list_issues(owner, repo, token):
    url = f"https://api.github.com/repos/{owner}/{repo}/issues"
    headers = {"Authorization": f"Bearer {token}", "User-Agent": "tp/0.1"}
    issues = []
    params = {"state": "all", "per_page": 100, "page": 1}
    while True:
        resp = requests.get(url, headers=headers, params=params, timeout=10)
        resp.raise_for_status()
        page = resp.json()
        if not page:
            break
        issues.extend(page)
        params["page"] += 1
    return issues

Stop when the response is an empty list. Some APIs also return a Link header with rel="next" — you can follow that instead of incrementing a page number.

Handle a flaky weather API

import os, time, requests

API_KEY = os.environ["OPENWEATHER_API_KEY"]

def current_weather(city):
    for attempt in range(3):
        try:
            resp = requests.get(
                "https://api.openweathermap.org/data/2.5/weather",
                params={"q": city, "appid": API_KEY, "units": "metric"},
                timeout=10,
            )
            if resp.status_code == 200:
                return resp.json()
            if resp.status_code == 429:
                time.sleep(int(resp.headers.get("Retry-After", 2 ** attempt)))
                continue
            resp.raise_for_status()
        except requests.Timeout:
            time.sleep(2 ** attempt)
    raise RuntimeError(f"Failed to fetch weather for {city}")

Three layers of defense: status-code checks, timeout handling, and exponential backoff.

24.11 Templates

A defensive GET helper:

import requests

def get_json(url, *, params=None, headers=None, timeout=10):
    default_headers = {
        "User-Agent": "my-project/0.1",
        "Accept": "application/json",
    }
    if headers:
        default_headers.update(headers)
    resp = requests.get(url, params=params, headers=default_headers, timeout=timeout)
    resp.raise_for_status()
    return resp.json()

Import this from every notebook instead of duplicating the boilerplate.

A .env file (paired with Chapter 34):

OPENWEATHER_API_KEY=abc123...
GITHUB_TOKEN=ghp_...

Load at the top of your script:

from dotenv import load_dotenv
load_dotenv()

24.12 Exercises

  1. Use requests.get to fetch https://api.github.com/repos/python/cpython and print the star count, license name, and default branch.
  2. Add a User-Agent header to the request above. Repeat the request without one (or with User-Agent: "") and see if you get the same response.
  3. Pick a public API (weather, Wikipedia, NASA Open Data, a Kaggle dataset) that requires a key. Sign up for a key, store it in a .env file, and write a short script that fetches one record. Do not commit the key.
  4. Write a function get_with_retry(url, max_retries=3) that retries on 429 and 5xx with exponential backoff, and raises on 4xx (except 429).
  5. Fetch a paginated endpoint (GitHub issues, Reddit posts, Hacker News) and build a DataFrame of every record across at least three pages. Print df.shape.
  6. Find a dataset that is available both as a bulk CSV download and as an API. Download both, load them into pandas, compare the row counts, and write down which was faster and why.
  7. Use resp = requests.get(...) and inspect resp.headers in a notebook. Find the Content-Type, Server, and any rate-limit headers (X-RateLimit-Remaining is common).

24.13 One-page checklist

  • Import requests; install with python -m pip install requests.
  • Always pass a timeout= to every request.
  • Check resp.status_code or call resp.raise_for_status() before using resp.json().
  • Pass query parameters as a params= dict, not by hand in the URL.
  • Send a descriptive User-Agent header.
  • Put API keys in environment variables, loaded from a .env file that is in .gitignore. See Chapter 34.
  • Sleep between requests; handle 429 with Retry-After or exponential backoff.
  • Prefer an official SDK when one exists.
  • Prefer a bulk download when you need a snapshot of the whole dataset.
  • Inspect resp.json() in a notebook before assuming its shape.
Note📚 Further reading
  • Python Software Foundation, requests documentation — the official guide to the most widely used Python HTTP library.
  • Encode, httpx documentation — a modern alternative to requests with the same shape but native async support; the right choice when you need concurrency.
  • MDN, An overview of HTTP — a clear, browser-agnostic explanation of HTTP methods, headers, and status codes.
  • HTTP Status Codes (httpstatuses.com) — a searchable reference for every status code you will see in the wild, with explanations.
  • HTTPie — a friendly command-line HTTP client; great for poking at an API before you wrap it in Python.
  • IETF, RFC 6749: The OAuth 2.0 Authorization Framework — the canonical authentication flow most modern APIs use; useful when “use the SDK” stops being an option.
  • IETF, RFC 9309: Robots Exclusion Protocol — the formal robots.txt standard; relevant when the API runs out and you are deciding whether scraping is appropriate.
  • Electronic Frontier Foundation, Coders’ Rights Project — ongoing legal explainers on the CFAA, DMCA, and security research; useful context for the “Stakes and politics” framing above.