36 How Language Models Work

Prerequisites and see-also

Prerequisites (read first if unfamiliar): Chapter 35.

Purpose

Ten Guy Meme: What if every word is just a point in a 4096-dimensional space?

You have probably used a chatbot, had AI help you write code, or asked a language model to explain a confusing error message. These tools are now embedded in text editors, search engines, notebooks, and APIs. But most people use them with little understanding of what is actually happening under the hood — and that gap causes real problems: you paste in a prompt, get a confident-sounding answer, and have no way to judge whether it is trustworthy, or why it failed when it did.

This chapter gives you the conceptual vocabulary you need to use language models more deliberately. You do not need to know how to train a model. You do not need to understand the mathematics of transformers. What you need is a working mental model: how text becomes input, how context shapes output, how temperature controls variation, how embeddings enable search, and what the difference is between typing into a chatbot and calling an API. With that mental model in place, you can write better prompts, diagnose failures faster, and build more reliable workflows around AI tools.

Learning objectives

By the end of this chapter, you should be able to:

Explain what tokens are and why tokenization affects model behavior
Describe what a context window is and how to work within its limits
Explain what temperature and sampling do and when to adjust them
Define embeddings and describe at least one practical use
Construct a well-structured prompt using system and user roles
Identify when to use a chatbot interface versus an API call
Explain what function calling (tool use) is and how a model invokes a tool
Describe at least two reasons models hallucinate and how internals contribute

Running theme: the model sees text, not meaning

Every behavior of a language model — useful and frustrating — follows from one fact: the model processes sequences of tokens and predicts likely continuations. It does not understand your intent, look things up, or reason the way a person does. The better you understand what the model actually sees, the better you can guide it toward what you actually want.

36.1 Tokens and tokenization

When you type a message into a language model, the first thing that happens is not reading. It is tokenization: your text is split into small chunks called tokens, and those tokens — not characters, not words — are what the model processes.

A token is roughly four characters of English text on average, which works out to about three-quarters of a word. The sentence “the quick brown fox” might become five tokens: the, quick, brown, fox (approximately). But tokenization is not simply splitting on spaces. Subword tokenization means common words become single tokens while rare or compound words get split further. The word unbelievable might be three tokens: un, believ, able.

Several practical implications follow from this:

Non-English text uses more tokens. Many tokenizers were trained primarily on English, so languages like Thai, Arabic, or Chinese may require two to four times as many tokens to express the same content. This affects both cost and context capacity.
Code is tokenized differently from prose. Symbols like {, =>, and indentation may each be their own tokens. Whitespace matters more than you might expect.
Counting tokens is not counting words. API pricing and context limits are in tokens, not words. Most providers offer tokenizer tools so you can measure before you send.
Rare or technical terms may get fragmented. A domain-specific word that the model rarely saw during training might be split into subwords that, individually, have different associations. This can subtly affect how the model interprets your request.

When diagnosing unexpected model behavior, tokenization is worth checking. If a model consistently misinterprets a technical term, it may be seeing it as two or three tokens with different meanings rather than one cohesive concept.

36.2 Context windows

A language model does not have persistent memory between conversations. Everything the model “knows” about your current task is contained in the context window: the full sequence of tokens that is sent to the model for a given request.

The context window includes your system prompt, the conversation history, any documents or code you paste in, and the model’s own previous responses. All of that must fit within a fixed limit — typically measured in thousands of tokens, with limits ranging from a few thousand to over a million tokens depending on the model.

When your input exceeds the context window, the model does not crash or warn you in an obvious way. Depending on the implementation, older content is silently truncated from the beginning of the context, which means the model may answer as if it never saw your earlier instructions or data.

Practical implications:

Earlier instructions get lost in long conversations. If you set up a detailed system prompt and then have a long back-and-forth, the original instructions may be pushed out. Restate important constraints periodically.
Pasting large files into chat is risky. A 10,000-line log file will consume most of a mid-range context window. Prefer to paste targeted excerpts rather than full files.
Context position matters. Models tend to attend more reliably to content at the beginning and end of the context than to content buried in the middle. For important instructions or key examples, placement matters.
Context windows are not permanent storage. If you start a new conversation, the model has no memory of previous sessions unless you explicitly provide that history.

36.3 Sampling and temperature

Language models do not produce a single deterministic answer. They produce a probability distribution over what token might come next, and then sample from that distribution. This is why the same prompt can produce different outputs each time you run it.

Temperature is the parameter that controls how that distribution is shaped before sampling:

At temperature 0, the model always picks the highest-probability token. Output is deterministic and consistent, but can feel repetitive and may get stuck in predictable patterns.
At low temperature (0.1–0.4), the model strongly favors likely tokens but still has some variation. Good for tasks where accuracy and consistency matter: code generation, structured data extraction, classification.
At medium temperature (0.5–0.8), the model balances probability and diversity. A common default for general-purpose conversation.
At high temperature (0.9–2.0), the model explores less probable options more often. Useful for brainstorming, creative writing, and generating diverse options — but more likely to produce errors, inconsistencies, or off-topic content.

Related parameters you may encounter:

Top-p (nucleus sampling): instead of a fixed temperature, only sample from the smallest set of tokens whose cumulative probability exceeds a threshold. Temperature and top-p are often used together.
Max tokens: limits the length of the output. If you set this too low, responses get truncated mid-sentence.
Stop sequences: strings that tell the model to stop generating. Useful for structured output: you can stop the model when it produces a delimiter like END or “‘.

When you need reproducible output (unit tests, data pipelines, evaluations), set temperature to 0. When you need variety (brainstorming, multiple drafts), raise it. When a model is giving you boring, repetitive answers, try increasing temperature slightly before rewriting your prompt.

36.4 Embeddings

Embeddings are a different use of language models from text generation. An embedding model converts text into a list of numbers — a vector — that encodes the meaning of that text in a high-dimensional space. Similar texts produce similar vectors; dissimilar texts produce vectors that are far apart.

Embeddings are the foundation of several practical applications:

Semantic search: instead of keyword matching, embed the query and all documents, then find documents whose vectors are closest to the query vector. This works even when the exact words differ (“reduce memory usage” matches “optimize RAM consumption”).
Retrieval-augmented generation (RAG): embed a large document collection, store embeddings in a vector database, and at query time retrieve the most relevant chunks to include in the model’s context. This is how you give a language model access to information that does not fit in its context window.
Clustering and classification: group texts by semantic similarity without labeling, or train a simple classifier on top of embedding vectors rather than raw text.
Deduplication: find near-duplicate records in a dataset by comparing embedding similarity.

Embeddings come from specialized embedding models (distinct from chat models) and are typically cheaper and faster to generate. If your task involves finding similar text, organizing documents by topic, or connecting a model to a large knowledge base, embeddings are usually the right tool — not just pasting everything into a chat window.

36.5 Prompting best practices

A prompt is not just a question. It is structured input that shapes the model’s behavior across the entire response. Understanding the anatomy of a well-constructed prompt lets you get more reliable, predictable output.

The anatomy of a prompt

Modern language model APIs organize input into roles:

System: Instructions set before the conversation begins. Use this for persistent context: who the model is, what format to use, what topics to avoid, and what assumptions to make.
User: The human’s turn. Your questions, requests, or data go here.
Assistant: The model’s responses. You can also pre-fill the assistant turn to steer the response format (“The answer is: …”).

In a chat interface, the system prompt is often hidden. In an API call, you set it explicitly. One of the biggest practical differences between a chatbot and an API call is that you control the system prompt.

Principles for reliable prompts

Be specific about the task. Vague prompts produce vague answers. “Summarize this” is worse than “Summarize this in three bullet points for an audience with no background in statistics.”
Provide format instructions. If you need JSON, a table, a numbered list, or markdown, say so explicitly. Language models follow format instructions reliably when they are clear.
Give examples (few-shot prompting). Showing two or three examples of input-output pairs dramatically improves consistency. This is especially useful for classification, extraction, and rewriting tasks.
Separate data from instructions. Use clear delimiters (triple backticks, XML tags, or section headers) to mark the boundary between your instructions and the text you want the model to operate on.
Assign a role when it helps. “You are a careful code reviewer” or “You are a plain-language writer” primes the model toward a useful perspective. Do not over-specify; one clear role beats a paragraph of competing instructions.
Ask for reasoning before the answer. “Explain your reasoning step by step, then give the final answer” produces more accurate results on multi-step problems than asking for the answer directly.
Test prompts systematically. Small wording changes can produce large output changes. Treat prompt development like code development: make one change at a time and evaluate the result before moving on.

36.6 API versus chatbot interfaces

The two primary ways to interact with a language model are the chatbot interface (a web or app UI like ChatGPT, Claude.ai, or Gemini) and the API (a programmatic interface — for example, the Anthropic API or the OpenAI API — you call from code). Understanding the difference helps you choose the right tool.

What a chatbot gives you

Low friction: type a message, read the response
Session memory (within a single conversation)
Access to built-in tools like web browsing, code execution, or image generation
Operator-defined system prompts you cannot see or change
No need to manage authentication or rate limits

Use a chatbot for exploratory, one-off tasks: getting an explanation, brainstorming, reviewing a single document, generating a first draft.

What an API gives you

Full control over system prompt, temperature, max tokens, and other parameters
The ability to call the model from inside code (scripts, notebooks, web apps)
Structured output via JSON mode or tool use
The ability to process many inputs in a loop (batch processing)
Cost transparency: you see exactly how many tokens each request consumes
No session memory by default — you manage conversation history yourself

A minimal Python API call looks like:

import anthropic    # https://docs.anthropic.com/en/api/getting-started

client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system="You are a helpful data analysis assistant.",
    messages=[
        {"role": "user", "content": "What is the median of [3, 7, 2, 9, 5]?"}
    ]
)
print(message.content[0].text)

When to use an API: any time you need to run the same prompt over multiple inputs, integrate AI into a script or pipeline, control parameters precisely, or build something reproducible.

36.7 Tools and function calling

Language models on their own cannot browse the web, run code, read files, or query databases. Function calling (also called tool use) is the mechanism by which a model can invoke external capabilities that you define.

The pattern works as follows:

You describe available tools to the model in a structured format: tool name, description, and the parameters it accepts.
The model, when it decides a tool is needed, responds with a structured tool-call request rather than a text answer.
Your code receives the tool-call request, runs the actual function, and sends the result back to the model.
The model incorporates the result into its final response.

A tool definition might look like:

{
  "name": "get_weather",
  "description": "Returns current weather for a given city.",
  "input_schema": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "City name, e.g. Denver"
      }
    },
    "required": ["city"]
  }
}

The model does not run the function. It decides whether to call it and constructs the arguments. Your code does the actual execution. This design keeps the model sandboxed: it can only invoke what you expose to it, with the parameters you define.

Function calling is powerful for connecting language models to live data, APIs, and databases. It is also where safety considerations become important: if a tool can modify files, send emails, or execute code, you need to be deliberate about what you expose and when you actually run what the model requests (see Chapter 35 for risk-based verification policies).

36.8 Why models hallucinate

“Hallucination” is the common term for when a language model confidently states something that is false. Understanding why this happens helps you anticipate and check for it.

The model predicts likely text, not true statements. Training on human-written text means the model learned what fluent, plausible-sounding sentences look like — not what is factually correct. A confident tone is common in training data even for false claims.
Rare facts are underrepresented in training. If the correct answer appeared rarely in training text, the model may not have learned it well. It may substitute a plausible-sounding but wrong answer drawn from more common patterns.
The training cutoff limits recency. Models are trained up to a knowledge cutoff date. Events, packages, APIs, and documentation that changed after that date may produce outdated or incorrect answers.
Tokenization artifacts. Fragmented tokens for technical terms can lead to misinterpretation and generation errors that look like factual errors.
There is no “I don’t know” training signal. Standard language modeling does not penalize the model for answering when uncertain; it is optimized to produce fluent completions. Some models have been fine-tuned to express uncertainty more reliably, but this is not universal.

Practical responses to hallucination:

For factual claims, always verify against primary sources.
For code, run it and test it; reading is not sufficient.
Ask the model to cite or explain where its information comes from.
Use retrieval (RAG) to ground the model in verified documents.
Prefer lower temperatures for factual tasks.
Ask “Are you confident in this?” or “What might be wrong here?” — models often acknowledge uncertainty when asked directly.

36.9 Stakes and politics

This chapter explains how the model works mechanically — tokens, attention, sampling, embeddings. The political dimension lives one level up, in the questions of what is being modeled and who paid to build the model that does the modeling.

Three things to notice. First, training corpora are not the world. LLMs are trained on the parts of human language that ended up on the internet, in the languages that dominate the internet, in the time period the crawl happened to cover. That corpus is overwhelmingly English (often more than 90% by token count even in “multilingual” models), drawn from contributors who are disproportionately male, US/European, and middle-class, and weighted toward formal published text rather than oral, vernacular, or domain-specific writing. The model’s “default” voice is the median voice of that corpus; languages and cultures outside it appear as edge cases at best.

Second, frontier-model training concentrates compute. Training a modern flagship model takes tens of thousands of GPUs, hundreds of millions to billions of dollars, and energy and water on a scale that is now a measurable fraction of regional grids. Only a handful of companies — OpenAI, Anthropic, Google, Meta, a small set of well-funded Chinese labs — can do this. The community of people who decide what the next major model will be like, what data it will see, and what it will refuse to do is correspondingly small. Open-weights releases (Meta’s Llama family, Mistral, and others) partly mitigate this; they do not change the fact that the upstream training labs decide what gets released at all.

Third, embeddings encode the same biases. The embedding spaces this chapter introduces are useful precisely because they capture statistical regularities in the corpus — including the regularities that reflect prejudice, stereotype, and uneven representation. “Distance in embedding space” is sometimes an objective measure and sometimes a measurement of who the corpus knew about.

See Chapter 8 for the broader framework, Chapter 35 for the user-facing workflow this chapter underpins, Chapter 37 for what happens when these models start acting, and Chapter 38 for how these biases get surfaced (or laundered) by audits. The concrete prompt to carry forward: when a model “knows” something, ask whose corpus it learned from.

36.10 Worked examples

Diagnosing why a prompt produces inconsistent output

You have a prompt that “usually works” but sometimes returns garbage, and you want to find out why before throwing more wording at it. The trick is to change one variable at a time. Start by writing down the current prompt and three or four sample outputs that demonstrate the inconsistency — actual text, side by side. Check the token count of your prompt with your provider’s tokenizer tool: if you are anywhere near the context window, that alone could be the cause. Then set temperature to 0 and re-run the same prompt several times. If outputs are now identical, the inconsistency was sampling variance; if they still vary, the inconsistency is in something else (maybe an external tool or retrieval step). Next, add explicit format instructions (“Return JSON with keys summary and tags”) and re-run; format failures often disappear instantly when the format is named. If the model is still wandering, add one or two few-shot examples showing exactly the output you want. After each change, compare against your original samples and note which change actually fixed the inconsistency — that is the lesson you want to keep.

import anthropic
client = anthropic.Anthropic()

def call(prompt, temperature=0):
    return client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        temperature=temperature,
        messages=[{"role": "user", "content": prompt}],
    ).content[0].text

print(call(prompt))                       # baseline
print(call(prompt, temperature=0))        # remove sampling variance
print(call(prompt + "\nReturn JSON."))    # add format instruction

Converting a chatbot workflow to an API call

You have been running the same task in a chatbot UI — paste a document, ask the same question, copy the answer — and you are tired of doing it by hand. Time to convert it to an API call. Identify exactly what you are doing interactively: the question (which probably has a stable structure), any pasted documents (the per-request input), and the format you want back. Pull the stable parts out as a system prompt (the instructions you repeat every time) and leave the variable parts as the user turn (the actual document or query). Set up an API client, write the call with explicit temperature and max_tokens so behavior is reproducible, and run it on a single example to compare against your chatbot result. Once they agree, parameterize the user turn to take a file path or query string from the command line, and wrap the whole thing in a loop over inputs. Log every input, output, and token count so you can audit later.

Building a simple semantic search with embeddings

You want to search a small document collection by meaning, not just keywords. Embeddings make this a 50-line script. Assemble your collection (50 paragraphs from a dataset is enough to play with), call an embedding API to generate a vector for each one, and store the vectors as a matrix. When a query arrives, embed it with the same embedding model — different models produce incompatible vector spaces. Compute cosine similarity between the query vector and every document vector, sort by score, and return the top three.

import numpy as np

docs = [...]  # 50 paragraph strings

def embed(text):
    # pseudo-code; replace with your provider's embedding call
    return np.array(provider.embed(text))

doc_vecs = np.stack([embed(d) for d in docs])

def search(query, k=3):
    q = embed(query)
    sims = doc_vecs @ q / (np.linalg.norm(doc_vecs, axis=1) * np.linalg.norm(q))
    top = sims.argsort()[::-1][:k]
    return [(docs[i], float(sims[i])) for i in top]

Test it with a handful of real queries and compare against grep over the same documents — you will quickly see the cases where keyword search misses paraphrases that semantic search catches.

36.11 Exercises

Use your model provider’s tokenizer tool to count the tokens in a 500-word essay. Then count the tokens in the same content in another language (use a translation tool if needed). How does the token count differ, and what does this imply for cost?
Write the same request (e.g., “Explain what a p-value is”) as both a vague prompt and a structured prompt with role, format instructions, and audience specification. Compare the outputs and describe what changed.
Run the same well-specified prompt five times at temperature 0 and five times at temperature 1.0. Document the variation. For what kinds of tasks does the difference matter most?
Write a short system prompt and a user prompt that together produce a consistent JSON output with three fields: summary, key_terms, and confidence. Test it on three different input documents.
Find a piece of text where a language model confidently gave you a wrong answer (or construct a case by asking about a recent event after the model’s training cutoff). Explain which of the hallucination causes from this chapter likely applies, and describe how you would verify the correct answer.
Make an API call from a Python script (using any model you have access to) that takes a string from the command line, sends it to the model with a system prompt you write, and prints the response. Confirm that the output changes when you change the system prompt.

36.12 One-page checklist

Before sending a long prompt, check the token count to ensure it fits within the context window
Set temperature to 0 for tasks requiring consistency; raise it only when variation is desirable
Use system prompts (not just user prompts) to set persistent instructions and format expectations
Use few-shot examples for classification, extraction, and structured output tasks
Separate instructions from data with clear delimiters (backticks, XML tags, or section headers)
Verify factual claims from AI output against primary sources before using them
Use embeddings when the task involves finding similar documents or building retrieval systems
Use the API (not just a chatbot) when you need to automate, loop, or control parameters precisely
When exposing tools to a model, define them with precise descriptions and only expose what is necessary
When outputs are inconsistent, diagnose whether the cause is sampling variance, prompt ambiguity, or context length before changing the prompt

📚 Further reading

Anthropic, Prompt engineering — structured prompt design patterns and evaluation tips.
OpenAI, Prompt engineering — OpenAI’s parallel guide, including system-message best practices.
Hugging Face, Transformers documentation — the reference implementation of the model architecture behind most modern chat models.
Vaswani et al., Attention Is All You Need (NeurIPS 2017) — the original transformer paper; short, technical, and worth reading once the conceptual picture in this chapter is clear.
Jay Alammar, The Illustrated Transformer — the canonical visual explanation of attention; 30 minutes of reading that demystifies most of what is happening inside the model.
Andrej Karpathy, Let’s build GPT from scratch in code, spelled out — a two-hour video that walks through implementing a small transformer end-to-end; the deepest practical understanding most non-specialists will get.
Anthropic, Mapping the Mind of a Large Language Model — public-facing summary of mechanistic interpretability research; useful context for “what is the model actually representing.”
Emily M. Bender et al., On the Dangers of Stochastic Parrots (FAccT, 2021) — required reading for the “Stakes and politics” framing above; pairs with the labor and corpus-bias points.