The first time you typed a question into ChatGPT and got back something that sounded like a thoughtful, articulate person had written it, you probably assumed the technology behind it was incredibly complex. Vast knowledge databases, sophisticated reasoning engines, maybe some digital wizardry too advanced for normal humans to comprehend.
The truth is almost comically simple. The model was doing one thing, over and over, billions of times: predicting the next word.
That’s the fundamental operation behind every large language model you’ve ever used. And the fact that “guess what word comes next” can produce coherent prose, sharp dialogue, and a passable query letter is arguably the most surprising discovery in the history of computing.
What It Actually Means
A large language model is an AI system trained on enormous quantities of text until it develops a deep statistical understanding of how language works. Not rules that someone programmed in. Not a database it looks things up in. Patterns it absorbed by reading more text than any human could process in a thousand lifetimes.
When you type a prompt into ChatGPT, Claude, or any LLM-powered writing tool, the model reads your words and calculates: based on everything I’ve learned about language, what’s the most likely next word? It picks one, feeds it back in, asks the same question, and repeats. Thousands of times. The result reads like natural prose because the model has internalized the statistical rhythms of billions of pages of natural prose.
The “large” in the name refers to scale along two dimensions: the staggering amount of text the model trained on (often trillions of words) and the number of internal numerical settings, called parameters, that it uses to process language. GPT-3, the model that kicked open the door to the current era, has 175 billion parameters. Each one is a tiny dial that was adjusted, incrementally, during training until the model produced coherent output. Nobody set those dials by hand. The model figured them out by reading, essentially, everything.
If foundation model is the broad category of large, general-purpose AI systems, a large language model is the specific type that works with text. Every LLM is a foundation model, but not every foundation model is an LLM (image generators like Midjourney use a different architecture called a diffusion model). When people mention GPT-4, Claude, Gemini, or Llama, they’re talking about LLMs.
A Name Nobody Coined
Unlike “artificial intelligence,” which John McCarthy named at a workshop in 1956, or “machine learning,” which Arthur Samuel coined in 1959, “large language model” has no single inventor. The phrase drifted into academic papers around 2019, emerging naturally as the models grew big enough to need a new name.
Before that, researchers called them “pre-trained language models” or simply described them by architecture. But when OpenAI released GPT-2 in February 2019 with 1.5 billion parameters, and then GPT-3 followed in 2020 at 175 billion, the word “large” stopped being optional. These weren’t just language models anymore. They were something qualitatively different.
GPT-3 was the inflection point. It was more than ten times larger than any previous language model, and that raw scale unlocked abilities nobody had specifically trained it for. It could translate between languages, write working code, answer trivia, and generate essays on topics it had never been explicitly taught. Researchers called these “emergent abilities,” capabilities that appeared as a side effect of sheer size, like a musician who practices scales so obsessively that they start improvising jazz.
The public caught up two years later. On November 30, 2022, OpenAI released ChatGPT, wrapping GPT-3.5 in a simple chat interface. It reached one hundred million users in two months, the fastest adoption of any consumer application in history. Seemingly overnight, “LLM” went from an academic shorthand to something your neighbor might drop into a dinner conversation.
How the Sausage Gets Made
Training an LLM happens in stages, and each one shapes what the model becomes.
Pretraining is the foundation. The model reads trillions of words of text (books, websites, academic papers, code, conversations) with a single task: predict the next word. Every wrong prediction nudges its parameters slightly in the right direction. This happens billions of times. By the end, the model has absorbed an enormous amount about language, facts, logic, style, and structure, not because anyone told it what to learn, but because predicting language well turns out to require understanding language deeply. That’s the profound trick: a trivial-sounding task, executed at sufficient scale, produces something that looks a lot like comprehension.
After pretraining, the model is knowledgeable but raw. It can continue any text you start, but it doesn’t know how to follow instructions or be helpful.
Fine-tuning fixes that. Human experts write thousands of example conversations (a user asks something, the expert writes an ideal response), and the model trains on those examples. Think of it as a novelist who has read every book ever written finally getting an editor who teaches them how to take direction.
Reinforcement learning from human feedback (RLHF) refines things further. Human reviewers rate different model responses to the same prompt, ranking them by quality and helpfulness. The model learns from those rankings, gradually developing something that looks like judgment and taste. This stage is why ChatGPT, Claude, and Gemini have noticeably different personalities despite training on broadly similar text. Each company made different choices about what “good” means, and those choices shaped the model’s voice.
Why It Matters for Your Writing Life
Every AI writing tool you use is either an LLM or is built on top of one. Sudowrite, NovelCrafter, ChatGPT, Claude, Jasper: all LLMs at their core, with different interfaces and customizations layered on top.
Understanding this gives you practical advantages.
You can write better prompts. An LLM isn’t searching a database for the right answer. It’s generating text based on patterns. The more context you provide (your genre, your voice, your specific scene), the better its predictions become. “Write a scene” gives the model almost nothing to work with. “Write a tense confrontation between a detective and her estranged sister in a rain-soaked parking garage, in the style of Tana French” gives it thousands of rich patterns to draw on.
You can evaluate tools more clearly. When a new AI writing app launches, you can ask the right question: which LLM powers this, and what have they built on top of it? Some tools add genuine value through thoughtful fine-tuning, smart interfaces, and author-specific features. Others are thin wrappers around the same model you could access directly through ChatGPT or Claude for less money. Knowing the difference saves you from overpaying for a pretty skin on someone else’s engine.
You can understand the limitations. LLMs predict probable language, not true statements. That’s why they sometimes “hallucinate,” confidently stating things that aren’t real. They’re not lying. They’re generating the most statistically plausible next word, and sometimes plausible and true diverge. Knowing this keeps you from trusting AI output blindly, especially for research, historical dates, or factual claims about real people and places.
The models keep getting more capable with every release, and more specialized too: some optimized for long-form fiction, others for editing, others for rapid brainstorming. But at their core, they’re all doing the same elegant, improbable thing. Predicting the next word, over and over, so well that the result reads like thought.