Appendix A — Glossary

This alphabetized glossary defines important terms from across the handbook so you can look up a word you hit mid-chapter without reading from the top. Each entry has an anchor ID that other chapters link to (for example, a mention of a package manager elsewhere in the book will be a direct link to the entry below).

The “technology stack” is the collection of hardware and software systems that enables an application to run. On contemporary personal computers it includes the following components. If you know the name of the component that is confusing you, you will have more success finding help online or from your instructors and peers.

Algorithmic audit

A structured evaluation of an AI or algorithmic system, usually focused on whether the system meets some standard of fairness, safety, or accuracy across the populations it affects. Audits can be first-party (run by the vendor that built the system), second-party (run by the organization deploying it), or third-party (run by an independent group with the access and incentive to find problems). The presence of an audit is not the same as the audit having teeth — see Chapter 38 and Chapter 8.

Application

A self-contained program that allows a user to perform specific tasks. Applications are built differently depending on the operating system and typically need permission to access the file system to store and retrieve data. Microsoft Word and Excel, Google Chrome, and Apple Mail are all applications.

Channel

When you install packages using conda, it downloads packages from a channel — a named bin that holds packages you can choose. Most often you will download packages from a large channel called conda-forge using -c conda-forge in conda commands. If conda can’t find a package, make sure you are using an appropriate channel.

Command line interface (CLI)

The most direct method for interacting with a computer’s operating and file systems. A CLI accepts only specifically-formatted text commands to navigate the file system and run applications and scripts, but it is potentially more customizable than graphical interfaces. The macOS CLI lives in the “Terminal” application; the Windows CLI lives in “Command Prompt,” “PowerShell,” or “Windows Terminal.”

CSV (Comma-Separated Values)

A plain-text file format for tabular data where each row is a line and each column is separated by a delimiter (traditionally a comma, sometimes a tab or semicolon). See Chapter 20 for the quirks of reading CSVs reliably.

Driver

Software that interfaces between hardware and software (for example, allowing an operating system to receive information from your mouse and keyboard) or between software programs (for example, allowing your web browser to combine network packets into a complete data request). Your operating system will typically keep drivers for hardware like video cards, sound, and networking up-to-date.

Environment

Code on your computer does not run in isolation. To run a Python file you need a Python interpreter stored somewhere on your computer that reads and executes each line of your code. Your Python program will sometimes load files from a given directory or access packages stored somewhere on your machine. A program’s environment is the combination of the interpreter, installed packages, environment variables, and working directory that the program uses to run. conda and venv both manage Python environments; see Chapter 15.

File system

The software and hardware that store, retrieve, and organize the data folders, files, and their permissions for the users on your computer.

Graphical user interface (GUI)

A way of interacting with a computer by clicking on icons rather than typing text commands. The macOS file system is accessed graphically via “Finder”; the Windows file system is accessed via “File Explorer.” You will sometimes hear GUI pronounced “ghee-you-eye” or “gooey.”

JSON (JavaScript Object Notation)

A plain-text file format for structured data that supports nested objects and arrays. It is the dominant format for data returned from web APIs. See Chapter 20 and Chapter 4.

Jupyter kernel

The computational engine that executes the code in a Jupyter notebook. Each notebook is connected to a kernel; the kernel determines which Python interpreter (and therefore which environment) your code runs in. If your imports fail or the wrong version of a package shows up, check the kernel first. See Chapter 16.

Large language model (LLM)

A neural network trained on large amounts of text that generates text by predicting the next token in a sequence. LLMs power AI assistants like ChatGPT, Claude, and GitHub Copilot. See Chapter 35 and Chapter 36.

Library

A set of software resources that expand the functionality of a programming language. The basic version of Python only includes a core set of functionality; libraries like numpy, pandas, and matplotlib add numerical computation, data manipulation, and plotting. “Library” and “package” are often used interchangeably in Python.

Markdown

A lightweight markup language that uses plain-text characters (#, *, -, backticks) to represent formatting like headings, emphasis, lists, and code blocks. Markdown is the standard format for README files, GitHub issues, Jupyter notebook text cells, and documentation tools like Quarto. See Chapter 4.

Matilda effect

The systematic under-citation and under-attribution of women’s contributions to research, named by historian Margaret Rossiter as a counterpart to the “Matthew effect” (where established scholars accumulate disproportionate credit). The pattern extends beyond gender to scholars of color, scholars from non-English-speaking institutions, and scholars from the Global South. See Chapter 26.

Notebook

An interactive document that mixes executable code, prose, and output. In Python, the Jupyter notebook format (.ipynb) is the dominant choice. A notebook runs in a particular environment; see Chapter 16.

Open access (OA)

Scholarly publishing models in which articles are freely readable on the open web rather than behind a subscription paywall. Gold OA journals are open by default (often funded by article-processing charges paid by the author or their institution); green OA refers to author-deposited versions in institutional or subject repositories like arXiv. See Chapter 25 and the Directory of Open Access Journals.

Operating system

The software that coordinates the hardware (processor, hard drive, keyboard, mouse, and so on) and other software (file system, interfaces, programs, compilers, drivers, and so on) on your computer. macOS, Windows, and Linux are operating systems. See Chapter 9.

Package

Computer code that somebody else has already written and published for others to install and use. numpy and pandas are common packages for data science.

Package manager

A tool that installs, updates, and removes packages, tracking their versions and dependencies. conda and pip are the two package managers most commonly used in Python data science. See Chapter 14.

Parquet

A column-oriented binary file format for tabular data, designed for fast reads of large datasets. Parquet files are smaller and faster to load than CSVs for analytical work. See Chapter 20.

PATH

An environment variable that tells your computer where to look for programs. If you see errors like command not found, it often means a program you want is not on your PATH.

pip

A package manager for Python, typically paired with venv. See Chapter 14 and Chapter 15.

Programming language

A structured language that a computer can execute. Python, Java, Ruby, C++, and JavaScript are examples. Languages share concepts like variables, loops, and functions but their syntax usually does not translate directly.

REPL (Read–Eval–Print Loop)

An interactive prompt that reads code, evaluates it, prints the result, and loops. Typing python with no arguments drops you into Python’s REPL. Jupyter notebooks are a graphical form of REPL.

RLHF (Reinforcement Learning from Human Feedback)

The post-training technique that teaches a base language model to follow instructions, refuse harmful requests, and produce useful responses. Human raters compare model outputs and rank them; those rankings train a reward model; the language model is then fine-tuned to maximize that reward. The labeling labor is typically performed by underpaid contract workers and is one of the hidden costs the AI industry externalizes. See Chapter 35 and Chapter 36.

Schema

The set of columns, types, and constraints that define the shape of a table or document store. A schema is also a frozen ontology — once a gender column allows two values or a country column uses a fixed list of ISO codes, those choices become claims the application makes before any user gets to disagree with them. See Chapter 23 and Chapter 8.

Script

A file of source code intended to be executed as a program. In Python, a script is typically a .py file you run with python script.py. See Chapter 17.

Terminal

A program that launches a shell and shows its input and output in a window. “Terminal” and “shell” are sometimes used interchangeably, but technically the terminal is the window and the shell is the program that interprets your commands. See Chapter 11.

Text editor

A program for editing files — like a word processor but for code. Common choices are VS Code, Sublime Text, Notepad++, vim, and emacs. See Chapter 12.

Traceback

The sequence of function calls Python prints when an error occurs, showing the path from your entry point down to the line that raised the exception. Reading tracebacks is one of the highest-leverage debugging skills. See Chapter 7.

Version

An identifier (usually something like 1.2.3) that marks a specific release of a package or language. Good package management pins the versions you depend on so your work reproduces later. See Chapter 14.

Virtual environment

An isolated Python installation scoped to a single project, so packages you install for one project don’t leak into others. python -m venv .venv creates one; source .venv/bin/activate turns it on. See Chapter 15.

YAML

A human-readable data serialization format that uses indentation (spaces, never tabs) to represent nested structure. YAML is the standard format for configuration files in tools like Quarto, GitHub Actions, and conda. See Chapter 4.