Tokenization

Before a model can process text, it needs to break it into pieces — tokens. The choice of how to split text has a surprisingly large impact on what the model can learn.

The Tokenization Trade-off

There are three strategies, each with a fundamental trade-off:

Character-Level

“hello” → ["h", "e", "l", "l", "o"] — 5 tokens

✅ Pros	❌ Cons
Tiny vocabulary (~100 characters)	Very long sequences
Handles any word, any language	Hard to learn word-level meaning
No out-of-vocabulary words	Slow training and inference

Word-Level

“hello world” → ["hello", "world"] — 2 tokens

✅ Pros	❌ Cons
Short sequences	Huge vocabulary (100K+)
Natural semantic units	Can’t handle rare/new words
	”cryptocurrency” → `[UNK]` 😢

Subword: The Sweet Spot

“unhappiness” → ["un", "happiness"] — the model learns that “un” means negation!

Input	Tokens	Why it works
”unhappiness”	`["un", "happiness"]`	Learns morphemes
”playing”	`["play", "ing"]`	Shares verb stems
”cryptocurrency”	`["crypt", "o", "currency"]`	Handles novel words

Subword tokenization gives us moderate vocabulary (30K–50K), handles any word (no UNK), and captures meaningful units like prefixes and suffixes.

🤔 Quick Check

Why is subword tokenization preferred over word-level or character-level?

Byte Pair Encoding (BPE)

BPE is the most widely used subword algorithm. The idea is simple:

Training: Learn Merge Rules

Start with individual characters as your vocabulary
Count every adjacent pair in the training corpus
Merge the most frequent pair into a new token
Repeat until you reach the desired vocabulary size

Walkthrough

Given the corpus: “low low lower newest newest widest”

Step	Most Frequent Pair	New Token	Vocabulary Change
Start	—	—	`{l, o, w, e, r, n, s, t, i, d}`
1	`(e, s)` → 20 times	`es`	+ `es`
2	`(es, t)` → 15 times	`est`	+ `est`
3	`(l, o)` → 12 times	`lo`	+ `lo`
4	`(lo, w)` → 12 times	`low`	+ `low`
5	`(n, e)` → 10 times	`ne`	+ `ne`
…	…	…	…

After training, common words become single tokens (“low”), while rare words get split into known subwords.

Applying BPE to New Text

To tokenize a new word, apply the learned merges in order:

"lowest" → ['l','o','w','e','s','t']     (start with characters)
         → ['lo','w','e','s','t']        (apply merge: l+o → lo)
         → ['low','e','s','t']           (apply merge: lo+w → low)
         → ['low','es','t']              (apply merge: e+s → es)
         → ['low','est']                 (apply merge: es+t → est)

Watch BPE in Action

Step through the merge process interactively:

BPE Tokenization Step by Step

Input word: Try: "unhappiness", "playing", "tokenization"

u |n |h |a |p |p |i |n |e |s |s

Step 1/11 Start: split into characters

11 tokens: ["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]

Character-level: 11 tokens ["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]

BPE result: 1 tokens ["unhappiness"]

Word-level: 1 token ["unhappiness"]

🤔 Quick Check

How does BPE handle a word it has never seen during training (like 'cryptocurrency')?

def train_bpe(corpus, num_merges):
    """Train BPE: repeatedly merge the most frequent adjacent pair."""
    # Initialize: each word as character sequence + end marker
    vocab = {}
    for word in corpus:
        chars = tuple(word) + ('</w>',)
        vocab[chars] = vocab.get(chars, 0) + 1
    
    merges = []
    for i in range(num_merges):
        # Count all adjacent pairs
        pairs = {}
        for word, freq in vocab.items():
            for j in range(len(word) - 1):
                pair = (word[j], word[j+1])
                pairs[pair] = pairs.get(pair, 0) + freq
        
        if not pairs:
            break
        
        # Merge most frequent pair
        best = max(pairs, key=pairs.get)
        new_token = best[0] + best[1]
        
        # Apply merge to all words in vocabulary
        new_vocab = {}
        for word, freq in vocab.items():
            new_word, idx = [], 0
            while idx < len(word):
                if idx < len(word)-1 and (word[idx], word[idx+1]) == best:
                    new_word.append(new_token)
                    idx += 2
                else:
                    new_word.append(word[idx])
                    idx += 1
            new_vocab[tuple(new_word)] = freq
        
        vocab = new_vocab
        merges.append((best, new_token))
    
    return merges

WordPiece (BERT’s Tokenizer)

WordPiece is similar to BPE but with a key difference:

Algorithm	Merge criterion
BPE	Most frequent pair
WordPiece	Pair that maximizes likelihood of the data

$\text{WordPiece: pair}^* = \arg\max_{\text{pair}} \frac{P(\text{merged})}{P(a) \cdot P(b)}$

This means WordPiece prefers merges that are more surprising — pairs that co-occur more than you’d expect by chance.

The ## Convention

BERT uses ## to mark continuation subwords (not the start of a word):

Word	Tokens	Meaning
”playing”	`["play", "##ing"]`	”play” starts the word, “ing” continues it
”unhappy”	`["un", "##happy"]`	”un” starts, “happy” continues
”unbelievable”	`["un", "##believ", "##able"]`	Three pieces

The ## prefix lets the model distinguish between “play” as a standalone word and “play” as the start of “playing”.

def wordpiece_tokenize(word, vocab, max_chars=200):
    """WordPiece: greedy longest-match from left to right."""
    if len(word) > max_chars:
        return ["[UNK]"]
    
    tokens, start = [], 0
    while start < len(word):
        end = len(word)
        found = None
        while start < end:
            substr = word[start:end]
            if start > 0:
                substr = "##" + substr
            if substr in vocab:
                found = substr
                break
            end -= 1
        if found is None:
            return ["[UNK]"]
        tokens.append(found)
        start = end
    return tokens

Special Tokens

Models reserve special tokens for structural purposes:

Token	Purpose	When it’s used
`[PAD]`	Padding	Making all sequences in a batch the same length
`[UNK]`	Unknown	Fallback for out-of-vocabulary tokens (rare with BPE)
`[CLS]`	Classification	BERT uses this position for sentence-level predictions
`[SEP]`	Separator	Between sentence pairs (e.g., question + passage)
`[MASK]`	Masking	BERT’s masked language model training
`<\|endoftext\|>`	End of text	GPT’s end-of-document marker

✍️ Fill in the Blanks

BERT uses the token at the start of every input for classification tasks, and the token to separate sentence pairs.

Tokenizer Comparison

Tokenizer	Used By	Vocab Size	Word Boundary Marker
BPE	GPT-2, RoBERTa	50,257	`Ġ` (space prefix)
WordPiece	BERT	30,522	`##` (continuation)
SentencePiece	T5, LLaMA	Variable	`▁` (space prefix)
Tiktoken	GPT-3/4	100,277	Byte-level BPE

Space Handling: Two Approaches

Different tokenizers handle word boundaries differently:

Approach	Example	Who uses it
Space prefix `Ġ` / `▁`	”Hello world” → `["Hello", "Ġworld"]`	GPT-2, T5, LLaMA
Continuation prefix `##`	”playing” → `["play", "##ing"]`	BERT

The space-prefix approach treats spaces as part of the next token. The continuation approach marks non-first subwords. Both are valid — just different conventions.

Why Tokenization Matters

Semantic Quality

Tokenization	Tokens	Can model learn?
“unhappiness” → `["un", "happiness"]`	Meaningful splits	✅ Learns “un” = negation
”unhappiness” → `["unha", "ppin", "ess"]`	Arbitrary splits	❌ No meaningful units

Multilingual Fairness

Tokenizers trained primarily on English are unfair to other languages:

Language	Text	Tokens	Ratio
English	”hello”	1 token	1:1
Chinese	”你好”	2–3 tokens	2–3× more
Arabic	”مرحبا”	3–5 tokens	3–5× more

This means non-English text uses more of the context window for the same semantic content — a fundamental equity issue in multilingual models.

Token Budget

Modern models have fixed context windows measured in tokens, not words:

Model	Context Window	English Words	Chinese Characters
GPT-4	8,192 tokens	~6,000 words	~3,000 characters
GPT-4-128K	128,000 tokens	~96,000 words	~48,000 characters

The same document might fit in English but overflow in Chinese — purely because of tokenization.

Exercises

Manual BPE: Train BPE on ["low", "lower", "lowest", "newer", "newest"] with 5 merges. What vocabulary do you get?
Hint
Start by counting character pairs across all words. The most frequent pairs will be related to the common endings (-er, -est) and beginnings (low, new).
Tokenize an unknown word: How would BERT’s WordPiece tokenize “cryptocurrency” if it’s not in the vocabulary?
Answer
WordPiece would greedily match from left to right: it might produce something like ["cry", "##pt", "##oc", "##ur", "##ren", "##cy"] depending on which subwords are in its 30K vocabulary. The key insight is that it never returns [UNK] for a word that can be decomposed into known character sequences.
Token count estimation: Why does the sentence “I love natural language processing” likely produce more tokens than 5 (the word count)?
Answer
While common words like “I”, “love”, “natural” are likely single tokens, longer/rarer words may be split: “processing” might become ["process", "##ing"]. The exact count depends on the tokenizer’s vocabulary.

Next Steps

With embeddings and tokenization covered, we now have the complete input pipeline: text → tokens → embeddings → model. Next we move to the attention mechanism that revolutionized NLP: Seq2Seq Attention →.