Token

When you type a sentence into ChatGPT or Claude, the AI doesn’t read your words. It can’t. Large language models have no concept of “words” at all. Before a model can process a single thing you’ve written, your text gets chopped into smaller pieces, fragments that sometimes look like words, sometimes look like syllables, and sometimes look like nonsense. These fragments are called tokens. They’re the actual currency of every AI interaction you’ve ever had, and understanding them unlocks a surprisingly practical set of insights about how your writing tools really work.

The Smallest Unit the AI Can See

A token is the fundamental unit of text that a language model processes. Think of it as the atom of AI language, the smallest piece the model can recognize and work with.

When you type “The detective examined the manuscript” into an AI writing tool, the model doesn’t see six words. It sees a sequence of numbers, each one representing a chunk of text from its vocabulary. Common words like “the” are usually a single token. Longer or less common words get split into pieces: “manuscript” might become “man” + “usc” + “ript,” three tokens for one word.

The rough math in English works out to about 1.3 tokens per word, or (flipping that around) about 75 words per 100 tokens. That ratio matters more than you’d think, because almost everything about how AI tools work, from how much text they can “remember” to how much they cost to run, is measured in tokens.

A Thousand Years of Meaning “Small Unit of Something”

The word “token” has been doing essentially the same job in English for over a millennium. It descends from Old English tacen, meaning “sign” or “symbol,” which shares an ancestor with the word “teach.” Both trace back to a Proto-Indo-European root, deik, meaning “to show” or “to point.” (That same root also gave us Latin dicere, “to say,” and digitus, “finger,” the thing you point with.)

By the 1590s, “token” had acquired the meaning of a coin-like piece of stamped metal: a stand-in for real currency within a closed system. Subway tokens, arcade tokens, poker chips. Small, standardized units that represent value. That’s almost exactly what AI tokens are, too.

The computing world borrowed the term in the 1960s. Compiler designers needed a name for the smallest meaningful chunks their programs identified when reading source code: variables, operators, keywords. Each one was a “token.” When natural language processing emerged as a field, the word came along naturally. If a compiler tokenizes code into its smallest meaningful parts, an NLP system tokenizes language into its smallest processable parts.

There’s a pleasing circularity here. A word that originally meant “to show” or “to teach” is now the name for what we show AI models so they can learn patterns in language. The etymology, for once, actually fits.

The Problem of How to Chop Up Language

For decades, the question of how to split text into tokens didn’t have a good answer.

Early NLP systems took the obvious approach: split on spaces. Every word becomes a token. Simple, clean, and deeply flawed. English alone has hundreds of thousands of words, with new ones appearing constantly. Misspellings, proper nouns, slang, compound words, words borrowed from other languages: a word-level tokenizer chokes on anything it hasn’t seen before.

The opposite extreme, making every single character a token, handles any text you throw at it. But the sequences become absurdly long, and the model loses the ability to see word-level patterns.

The breakthrough came from an unexpected place. In February 1994, a programmer named Philip Gage published a short article in C Users Journal describing a data compression algorithm called Byte Pair Encoding (BPE). The idea was simple: scan a file, find the most frequently occurring pair of adjacent bytes, replace that pair with a single new symbol, and repeat. Common patterns get merged into single units. Rare patterns stay broken apart.

Twenty years later, three researchers at the University of Edinburgh (Rico Sennrich, Barry Haddow, and Alexandra Birch) realized Gage’s compression trick could solve the tokenization problem for AI translation systems. Instead of compressing bytes for storage, you could compress characters into subword units for language processing. Their 2015 paper adapted BPE for neural machine translation, and the technique worked so well that it became the foundation of virtually every modern tokenizer.

The process has an almost puzzle-like quality. You start with individual characters. Find the most frequent pair (“t” and “h” appear together constantly in English). Merge them into one token: “th.” Now “th” and “e” are the most popular pair. Merge those: “the.” Keep going until you’ve built a vocabulary of tens of thousands of tokens. The algorithm independently rediscovers the most natural building blocks of English: “ing,” “tion,” “er,” “the.” Nobody programs these patterns in. They emerge from frequency alone.

The Quirks That Surprise Everyone

Tokenization has some genuinely unexpected behaviors once you look under the hood.

The word ” the” (with a leading space) and “the” (without one) are completely different tokens. In running text, most words have a space before them, and the tokenizer learns to treat that space as part of the token rather than a separate unit. This is invisible to you as a user, but it’s one of the reasons AI models are sensitive to formatting in ways that can seem arbitrary.

Numbers are strangely fragmented: “1234567890” becomes four tokens, not one or ten, which partly explains why language models stumble with arithmetic. Emojis are even worse. Over 97% of emojis get split into multiple token fragments, each so rare in the training data that the model barely knows what to do with them.

Non-English languages pay a real penalty. Because most tokenizers are trained primarily on English text, Chinese, Japanese, and Korean characters often require two to three tokens each. A passage in Mandarin can consume nearly twice as many tokens as its English equivalent, which means shorter effective context windows and higher costs for non-English users.

Why Tokens Shape Your Entire AI Experience

If you use any AI writing tool, tokens are silently governing what’s possible.

Context windows are measured in tokens. When Claude advertises a 200,000-token context window, that translates to roughly 150,000 words of English text, about two full novels. That’s the total amount of text (your input plus the AI’s response) the model can hold in its working memory at once. When a chatbot seems to “forget” something you mentioned earlier in a long conversation, you’ve likely bumped up against the token limit.

Pricing runs on tokens. AI companies charge per token through their APIs, with output tokens (what the model generates) costing more than input tokens (what you send). Consumer tools like ChatGPT and Sudowrite bundle these costs into subscriptions, but they’re still there under the surface, quietly determining how many words you can generate before hitting a usage cap. Sudowrite’s story credits, for instance, correspond to token allocations under the hood.

Your prompts consume tokens, too. Every word of context you provide, every instruction, every chunk of manuscript you paste in, all of it counts against the token budget. Being concise in your prompts isn’t just good practice for getting better results. It’s efficient use of a limited resource. A bloated system prompt eats into the space available for your actual writing.

Understanding tokens won’t change your prose. But the next time an AI tool warns you about length limits, charges more for a longer interaction, or seems to lose the thread of a conversation, you’ll know exactly what’s happening at the mechanical level. And that knowledge, small as it sounds, is the difference between using a tool and understanding it.