Your kid asks you what a large language model is. You panic. You mumble something about neural networks and training data. Their eyes glaze over. You’ve lost them.
Let’s fix that — for both of you.
Understanding how generative AI works isn’t optional anymore. It’s the difference between someone who gets consistently great output and someone who rage-quits after three bad responses. Every technique I’ve written about — from writing better prompts to behavior-driven prompting — works because of how these systems are built. Once you see the machinery, the techniques stop feeling like tricks and start feeling obvious.
The world’s most sophisticated autocomplete
Here’s the five-year-old version: an LLM is a next-word prediction machine.
That’s it. When you type “the cat sat on the,” the model looks at every word it’s seen in training — billions of documents, books, conversations, code — and calculates the probability of what comes next. “Mat” is likely. “Quantum” is not. It picks a word, adds it to the sequence, and repeats.
It doesn’t know what a cat is. It doesn’t understand sitting. It has learned statistical patterns about which words follow other words in human language. Those patterns are deep and complex enough that the output looks like understanding — but the mechanism is pattern matching, not comprehension.
This matters because it explains the single most common frustration people have with AI: it sounds confident about things it’s wrong about. It’s not lying. It’s not confused. It’s doing exactly what it was built to do — generating the most statistically probable next token. Sometimes the most probable sequence happens to be wrong.
How it learns: training on everything
Imagine you read every book ever written, every Wikipedia article, every Reddit thread, every GitHub repo. You didn’t memorize them word for word — but you absorbed patterns. You learned that legal documents sound a certain way. Medical papers have a structure. Python code follows conventions. Casual conversation has a rhythm.
That’s training. The model processes massive amounts of text and adjusts billions of internal parameters — think of them as tiny dials — until it gets good at predicting what comes next in a sequence. The training doesn’t store facts in a database. It encodes relationships between words and concepts as numerical weights.
This is why LLMs can write poetry in the style of Shakespeare and also generate working Python code. They’re not switching between different skills. They’re doing the same thing — predicting likely sequences — across different domains.
Temperature: the creativity dial
When the model predicts the next word, it doesn’t just pick the single most likely one every time. There’s a setting called temperature that controls how adventurous it gets.
- Low temperature (close to 0): the model almost always picks the most probable word. Output is predictable, consistent, and a bit boring. Great for code, factual answers, and structured tasks.
- High temperature (close to 1 or above): the model is willing to pick less probable words. Output is more creative, surprising, and sometimes nonsensical. Better for brainstorming, fiction, and exploration.
Think of it like asking a five-year-old to finish the sentence “the dog went to the…” At low temperature, they say “park.” At high temperature, they say “moon on a bicycle made of cheese.” Both are valid outputs from the same system — the dial just changes which possibilities get selected.
This is why the same prompt can give you different results each time. The model isn’t being inconsistent. It’s sampling from a probability distribution, and temperature controls how broadly it samples.
Context windows: short-term memory, no long-term memory
Here’s where people get into trouble. An LLM has no memory between conversations. Zero. Each time you start a new chat, the model knows nothing about you, your project, or the fourteen previous conversations you’ve had.
Within a single conversation, the model can “remember” — but only what fits in its context window. Think of it as a desk. Everything on the desk is visible. The model can reference it, reason about it, and build on it. But the desk has a fixed size. Once you pile on more text than the desk can hold, the oldest stuff falls off the edge.
This is exactly why spec-driven development matters so much when working with AI agents. If the agent can’t see the spec, it can’t follow the spec. It’s not being lazy or forgetful — it literally does not have access to information that’s fallen out of the context window. This is also why AGENTS.md files outperform on-demand skill retrieval — passive context that’s always loaded on the desk beats skills that might not get picked up at all.
Tokens: the atoms of AI language
LLMs don’t read words the way you do. They break text into tokens — chunks that are usually a word or part of a word. “Understanding” might be two tokens: “understand” and “ing.” Common words like “the” are one token. Unusual words get split into more pieces.
Why does this matter? Because everything has a token cost. Your prompt uses tokens. The model’s response uses tokens. The context window is measured in tokens. When people say a model has a “200K context window,” they mean roughly 200,000 tokens — which is roughly 150,000 words.
It also means the model doesn’t see your text the way you do. It sees a sequence of token IDs — numbers. The entire process of “reading” your prompt and “writing” a response happens in a mathematical space where words have been converted to numbers, processed through layers of matrix multiplication, and converted back. There’s no little person inside reading your message.
Hallucinations: the feature nobody wanted
The model generates plausible-sounding text. Usually that text is accurate because the training data was accurate. But sometimes the most statistically probable sequence is completely fabricated — a citation that doesn’t exist, a function that was never part of an API, a historical event that never happened.
This isn’t a bug. It’s a direct consequence of how the system works. The model doesn’t have a “fact database” it checks against. It generates sequences that pattern-match to what factual statements look like. Most of the time the pattern and the fact align. Sometimes they don’t.
Knowing this changes how you work with AI. You stop trusting output blindly. You verify claims. You give the model reference material in the prompt so it can pattern-match against correct information rather than whatever its training weights suggest. This is why the Ralph Loop emphasizes evaluating output at every iteration — not because the AI is unreliable, but because any generation system needs a feedback mechanism.
The pink elephant under the hood
Here’s something that clicks once you understand next-word prediction: the pink elephant problem.
When you tell the model “don’t write in a formal tone,” every word in that instruction — “formal,” “tone” — gets tokenized and fed into the prediction engine. The model’s weights now have “formal tone” activated in its context. The very thing you told it to avoid is now primed in the system that predicts the next word.
This isn’t a quirk. It’s physics. The model predicts based on what’s in the context, and you just loaded “formal tone” into the context. Tell it what you want instead, and you load the right patterns. “Write conversationally” primes conversational patterns. This is why positive framing isn’t a style preference — it’s mechanically how the system processes instructions.
Why this knowledge is power
Most people treat AI like a magic box. Prompt goes in, answer comes out, and when the answer is bad they either give up or try random changes until something works.
Understanding the mechanics turns random into systematic:
- Bad output? Check whether the context window has what the model needs. If you’re fifty messages deep, the early instructions may have fallen off the desk.
- Inconsistent results? Consider the temperature. Lower it for tasks that need precision. Raise it when you want exploration.
- Hallucinating facts? Provide the facts in the prompt. The model is better at pattern-matching against provided reference material than generating facts from training weights alone.
- Ignoring instructions? Look at how you framed them. Negative framing primes the wrong patterns. Restructure as positive statements about what you want.
- Agent losing track of the plan? It doesn’t have a plan — it has a context window. Put the plan in a spec file that stays loaded. This is the entire thesis behind spec-driven development.
Every technique in writing better prompts maps directly to the mechanics. Being succinct? Saves tokens and keeps important context on the desk. Giving the model a role? Primes a specific set of language patterns. Providing examples? Gives the prediction engine concrete patterns to match against instead of guessing from training data.
The five-year-old version, one more time
An LLM is a machine that got really good at finishing sentences by reading everything humans have ever written. It doesn’t think. It doesn’t understand. It predicts what word comes next, over and over, until it has a full response.
It’s incredibly good at this — good enough to write code, explain science, draft legal documents, and hold conversations that feel human. But it’s still predicting, not reasoning. When you know that, you stop asking “why is it so dumb?” and start asking “what patterns am I giving it to work with?”
That second question is the one that gets results.
Recommended reading
If you want to go deeper than the five-year-old version, these two books are worth your time:
- Build a Large Language Model (From Scratch) by Sebastian Raschka — walks you through building an actual LLM step by step. You’ll understand tokenization, attention mechanisms, and training loops by doing them, not just reading about them. Best way to demystify the black box.
- How AI Works: From Sorcery to Science by Ronald T. Kneusel — covers AI from the ground up without assuming a math degree. If this article made you curious about the layers beneath next-word prediction, this book peels them back methodically.