Tokenization
Before a model can process text, it needs to break it into pieces — tokens. The choice of how to split text has a surprisingly large impact on what the model can learn.
The Tokenization Trade-off
There are three strategies, each with a fundamental trade-off:
Character-Level
“hello” → ["h", "e", "l", "l", "o"] — 5 tokens
| ✅ Pros | ❌ Cons |
|---|---|
| Tiny vocabulary (~100 characters) | Very long sequences |
| Handles any word, any language | Hard to learn word-level meaning |
| No out-of-vocabulary words | Slow training and inference |
Word-Level
“hello world” → ["hello", "world"] — 2 tokens
| ✅ Pros | ❌ Cons |
|---|---|
| Short sequences | Huge vocabulary (100K+) |
| Natural semantic units | Can’t handle rare/new words |
”cryptocurrency” → [UNK] 😢 |
Subword: The Sweet Spot
“unhappiness” → ["un", "happiness"] — the model learns that “un” means negation!
| Input | Tokens | Why it works |
|---|---|---|
| ”unhappiness” | ["un", "happiness"] | Learns morphemes |
| ”playing” | ["play", "ing"] | Shares verb stems |
| ”cryptocurrency” | ["crypt", "o", "currency"] | Handles novel words |
Subword tokenization gives us moderate vocabulary (30K–50K), handles any word (no UNK), and captures meaningful units like prefixes and suffixes.
Byte Pair Encoding (BPE)
BPE is the most widely used subword algorithm. The idea is simple:
Training: Learn Merge Rules
- Start with individual characters as your vocabulary
- Count every adjacent pair in the training corpus
- Merge the most frequent pair into a new token
- Repeat until you reach the desired vocabulary size
Walkthrough
Given the corpus: “low low lower newest newest widest”
| Step | Most Frequent Pair | New Token | Vocabulary Change |
|---|---|---|---|
| Start | — | — | {l, o, w, e, r, n, s, t, i, d} |
| 1 | (e, s) → 20 times | es | + es |
| 2 | (es, t) → 15 times | est | + est |
| 3 | (l, o) → 12 times | lo | + lo |
| 4 | (lo, w) → 12 times | low | + low |
| 5 | (n, e) → 10 times | ne | + ne |
| … | … | … | … |
After training, common words become single tokens (“low”), while rare words get split into known subwords.
Applying BPE to New Text
To tokenize a new word, apply the learned merges in order:
"lowest" → ['l','o','w','e','s','t'] (start with characters)
→ ['lo','w','e','s','t'] (apply merge: l+o → lo)
→ ['low','e','s','t'] (apply merge: lo+w → low)
→ ['low','es','t'] (apply merge: e+s → es)
→ ['low','est'] (apply merge: es+t → est)
Watch BPE in Action
Step through the merge process interactively:
BPE Tokenization Step by Step
def train_bpe(corpus, num_merges):
"""Train BPE: repeatedly merge the most frequent adjacent pair."""
# Initialize: each word as character sequence + end marker
vocab = {}
for word in corpus:
chars = tuple(word) + ('</w>',)
vocab[chars] = vocab.get(chars, 0) + 1
merges = []
for i in range(num_merges):
# Count all adjacent pairs
pairs = {}
for word, freq in vocab.items():
for j in range(len(word) - 1):
pair = (word[j], word[j+1])
pairs[pair] = pairs.get(pair, 0) + freq
if not pairs:
break
# Merge most frequent pair
best = max(pairs, key=pairs.get)
new_token = best[0] + best[1]
# Apply merge to all words in vocabulary
new_vocab = {}
for word, freq in vocab.items():
new_word, idx = [], 0
while idx < len(word):
if idx < len(word)-1 and (word[idx], word[idx+1]) == best:
new_word.append(new_token)
idx += 2
else:
new_word.append(word[idx])
idx += 1
new_vocab[tuple(new_word)] = freq
vocab = new_vocab
merges.append((best, new_token))
return mergesWordPiece (BERT’s Tokenizer)
WordPiece is similar to BPE but with a key difference:
| Algorithm | Merge criterion |
|---|---|
| BPE | Most frequent pair |
| WordPiece | Pair that maximizes likelihood of the data |
This means WordPiece prefers merges that are more surprising — pairs that co-occur more than you’d expect by chance.
The ## Convention
BERT uses ## to mark continuation subwords (not the start of a word):
| Word | Tokens | Meaning |
|---|---|---|
| ”playing” | ["play", "##ing"] | ”play” starts the word, “ing” continues it |
| ”unhappy” | ["un", "##happy"] | ”un” starts, “happy” continues |
| ”unbelievable” | ["un", "##believ", "##able"] | Three pieces |
The ## prefix lets the model distinguish between “play” as a standalone word and “play” as the start of “playing”.
def wordpiece_tokenize(word, vocab, max_chars=200):
"""WordPiece: greedy longest-match from left to right."""
if len(word) > max_chars:
return ["[UNK]"]
tokens, start = [], 0
while start < len(word):
end = len(word)
found = None
while start < end:
substr = word[start:end]
if start > 0:
substr = "##" + substr
if substr in vocab:
found = substr
break
end -= 1
if found is None:
return ["[UNK]"]
tokens.append(found)
start = end
return tokensSpecial Tokens
Models reserve special tokens for structural purposes:
| Token | Purpose | When it’s used |
|---|---|---|
[PAD] | Padding | Making all sequences in a batch the same length |
[UNK] | Unknown | Fallback for out-of-vocabulary tokens (rare with BPE) |
[CLS] | Classification | BERT uses this position for sentence-level predictions |
[SEP] | Separator | Between sentence pairs (e.g., question + passage) |
[MASK] | Masking | BERT’s masked language model training |
<|endoftext|> | End of text | GPT’s end-of-document marker |
Tokenizer Comparison
| Tokenizer | Used By | Vocab Size | Word Boundary Marker |
|---|---|---|---|
| BPE | GPT-2, RoBERTa | 50,257 | Ġ (space prefix) |
| WordPiece | BERT | 30,522 | ## (continuation) |
| SentencePiece | T5, LLaMA | Variable | ▁ (space prefix) |
| Tiktoken | GPT-3/4 | 100,277 | Byte-level BPE |
Space Handling: Two Approaches
Different tokenizers handle word boundaries differently:
| Approach | Example | Who uses it |
|---|---|---|
Space prefix Ġ / ▁ | ”Hello world” → ["Hello", "Ġworld"] | GPT-2, T5, LLaMA |
Continuation prefix ## | ”playing” → ["play", "##ing"] | BERT |
The space-prefix approach treats spaces as part of the next token. The continuation approach marks non-first subwords. Both are valid — just different conventions.
Why Tokenization Matters
Semantic Quality
| Tokenization | Tokens | Can model learn? |
|---|---|---|
“unhappiness” → ["un", "happiness"] | Meaningful splits | ✅ Learns “un” = negation |
”unhappiness” → ["unha", "ppin", "ess"] | Arbitrary splits | ❌ No meaningful units |
Multilingual Fairness
Tokenizers trained primarily on English are unfair to other languages:
| Language | Text | Tokens | Ratio |
|---|---|---|---|
| English | ”hello” | 1 token | 1:1 |
| Chinese | ”你好” | 2–3 tokens | 2–3× more |
| Arabic | ”مرحبا” | 3–5 tokens | 3–5× more |
This means non-English text uses more of the context window for the same semantic content — a fundamental equity issue in multilingual models.
Token Budget
Modern models have fixed context windows measured in tokens, not words:
| Model | Context Window | English Words | Chinese Characters |
|---|---|---|---|
| GPT-4 | 8,192 tokens | ~6,000 words | ~3,000 characters |
| GPT-4-128K | 128,000 tokens | ~96,000 words | ~48,000 characters |
The same document might fit in English but overflow in Chinese — purely because of tokenization.
Exercises
-
Manual BPE: Train BPE on
["low", "lower", "lowest", "newer", "newest"]with 5 merges. What vocabulary do you get?Hint
Start by counting character pairs across all words. The most frequent pairs will be related to the common endings (-er, -est) and beginnings (low, new). -
Tokenize an unknown word: How would BERT’s WordPiece tokenize “cryptocurrency” if it’s not in the vocabulary?
Answer
WordPiece would greedily match from left to right: it might produce something like["cry", "##pt", "##oc", "##ur", "##ren", "##cy"]depending on which subwords are in its 30K vocabulary. The key insight is that it never returns [UNK] for a word that can be decomposed into known character sequences. -
Token count estimation: Why does the sentence “I love natural language processing” likely produce more tokens than 5 (the word count)?
Answer
While common words like “I”, “love”, “natural” are likely single tokens, longer/rarer words may be split: “processing” might become["process", "##ing"]. The exact count depends on the tokenizer’s vocabulary.
Next Steps
With embeddings and tokenization covered, we now have the complete input pipeline: text → tokens → embeddings → model. Next we move to the attention mechanism that revolutionized NLP: Seq2Seq Attention →.