Tokenization

Before a model can process text, it needs to break it into pieces — tokens. The choice of how to split text has a surprisingly large impact on what the model can learn.


The Tokenization Trade-off

There are three strategies, each with a fundamental trade-off:

Character-Level

“hello”["h", "e", "l", "l", "o"] — 5 tokens

✅ Pros❌ Cons
Tiny vocabulary (~100 characters)Very long sequences
Handles any word, any languageHard to learn word-level meaning
No out-of-vocabulary wordsSlow training and inference

Word-Level

“hello world”["hello", "world"] — 2 tokens

✅ Pros❌ Cons
Short sequencesHuge vocabulary (100K+)
Natural semantic unitsCan’t handle rare/new words
”cryptocurrency”[UNK] 😢

Subword: The Sweet Spot

“unhappiness”["un", "happiness"] — the model learns that “un” means negation!

InputTokensWhy it works
”unhappiness”["un", "happiness"]Learns morphemes
”playing”["play", "ing"]Shares verb stems
”cryptocurrency”["crypt", "o", "currency"]Handles novel words

Subword tokenization gives us moderate vocabulary (30K–50K), handles any word (no UNK), and captures meaningful units like prefixes and suffixes.

🤔 Quick Check
Why is subword tokenization preferred over word-level or character-level?

Byte Pair Encoding (BPE)

BPE is the most widely used subword algorithm. The idea is simple:

Training: Learn Merge Rules

  1. Start with individual characters as your vocabulary
  2. Count every adjacent pair in the training corpus
  3. Merge the most frequent pair into a new token
  4. Repeat until you reach the desired vocabulary size

Walkthrough

Given the corpus: “low low lower newest newest widest”

StepMost Frequent PairNew TokenVocabulary Change
Start{l, o, w, e, r, n, s, t, i, d}
1(e, s) → 20 timeses+ es
2(es, t) → 15 timesest+ est
3(l, o) → 12 timeslo+ lo
4(lo, w) → 12 timeslow+ low
5(n, e) → 10 timesne+ ne

After training, common words become single tokens (“low”), while rare words get split into known subwords.

Applying BPE to New Text

To tokenize a new word, apply the learned merges in order:

"lowest" → ['l','o','w','e','s','t']     (start with characters)
         → ['lo','w','e','s','t']        (apply merge: l+o → lo)
         → ['low','e','s','t']           (apply merge: lo+w → low)
         → ['low','es','t']              (apply merge: e+s → es)
         → ['low','est']                 (apply merge: es+t → est)

Watch BPE in Action

Step through the merge process interactively:

BPE Tokenization Step by Step

Try: "unhappiness", "playing", "tokenization"
u |n |h |a |p |p |i |n |e |s |s
Step 1/11 Start: split into characters
11 tokens: ["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]
Character-level: 11 tokens ["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]
BPE result: 1 tokens ["unhappiness"]
Word-level: 1 token ["unhappiness"]
🤔 Quick Check
How does BPE handle a word it has never seen during training (like 'cryptocurrency')?
def train_bpe(corpus, num_merges):
    """Train BPE: repeatedly merge the most frequent adjacent pair."""
    # Initialize: each word as character sequence + end marker
    vocab = {}
    for word in corpus:
        chars = tuple(word) + ('</w>',)
        vocab[chars] = vocab.get(chars, 0) + 1
    
    merges = []
    for i in range(num_merges):
        # Count all adjacent pairs
        pairs = {}
        for word, freq in vocab.items():
            for j in range(len(word) - 1):
                pair = (word[j], word[j+1])
                pairs[pair] = pairs.get(pair, 0) + freq
        
        if not pairs:
            break
        
        # Merge most frequent pair
        best = max(pairs, key=pairs.get)
        new_token = best[0] + best[1]
        
        # Apply merge to all words in vocabulary
        new_vocab = {}
        for word, freq in vocab.items():
            new_word, idx = [], 0
            while idx < len(word):
                if idx < len(word)-1 and (word[idx], word[idx+1]) == best:
                    new_word.append(new_token)
                    idx += 2
                else:
                    new_word.append(word[idx])
                    idx += 1
            new_vocab[tuple(new_word)] = freq
        
        vocab = new_vocab
        merges.append((best, new_token))
    
    return merges

WordPiece (BERT’s Tokenizer)

WordPiece is similar to BPE but with a key difference:

AlgorithmMerge criterion
BPEMost frequent pair
WordPiecePair that maximizes likelihood of the data

WordPiece: pair=argmaxpairP(merged)P(a)P(b)\text{WordPiece: pair}^* = \arg\max_{\text{pair}} \frac{P(\text{merged})}{P(a) \cdot P(b)}

This means WordPiece prefers merges that are more surprising — pairs that co-occur more than you’d expect by chance.

The ## Convention

BERT uses ## to mark continuation subwords (not the start of a word):

WordTokensMeaning
”playing”["play", "##ing"]”play” starts the word, “ing” continues it
”unhappy”["un", "##happy"]”un” starts, “happy” continues
”unbelievable”["un", "##believ", "##able"]Three pieces

The ## prefix lets the model distinguish between “play” as a standalone word and “play” as the start of “playing”.

def wordpiece_tokenize(word, vocab, max_chars=200):
    """WordPiece: greedy longest-match from left to right."""
    if len(word) > max_chars:
        return ["[UNK]"]
    
    tokens, start = [], 0
    while start < len(word):
        end = len(word)
        found = None
        while start < end:
            substr = word[start:end]
            if start > 0:
                substr = "##" + substr
            if substr in vocab:
                found = substr
                break
            end -= 1
        if found is None:
            return ["[UNK]"]
        tokens.append(found)
        start = end
    return tokens

Special Tokens

Models reserve special tokens for structural purposes:

TokenPurposeWhen it’s used
[PAD]PaddingMaking all sequences in a batch the same length
[UNK]UnknownFallback for out-of-vocabulary tokens (rare with BPE)
[CLS]ClassificationBERT uses this position for sentence-level predictions
[SEP]SeparatorBetween sentence pairs (e.g., question + passage)
[MASK]MaskingBERT’s masked language model training
<|endoftext|>End of textGPT’s end-of-document marker
✍️ Fill in the Blanks
BERT uses the token at the start of every input for classification tasks, and the token to separate sentence pairs.

Tokenizer Comparison

TokenizerUsed ByVocab SizeWord Boundary Marker
BPEGPT-2, RoBERTa50,257Ġ (space prefix)
WordPieceBERT30,522## (continuation)
SentencePieceT5, LLaMAVariable (space prefix)
TiktokenGPT-3/4100,277Byte-level BPE

Space Handling: Two Approaches

Different tokenizers handle word boundaries differently:

ApproachExampleWho uses it
Space prefix Ġ / ”Hello world”["Hello", "Ġworld"]GPT-2, T5, LLaMA
Continuation prefix ##”playing”["play", "##ing"]BERT

The space-prefix approach treats spaces as part of the next token. The continuation approach marks non-first subwords. Both are valid — just different conventions.


Why Tokenization Matters

Semantic Quality

TokenizationTokensCan model learn?
“unhappiness”["un", "happiness"]Meaningful splits✅ Learns “un” = negation
”unhappiness”["unha", "ppin", "ess"]Arbitrary splits❌ No meaningful units

Multilingual Fairness

Tokenizers trained primarily on English are unfair to other languages:

LanguageTextTokensRatio
English”hello”1 token1:1
Chinese”你好”2–3 tokens2–3× more
Arabic”مرحبا”3–5 tokens3–5× more

This means non-English text uses more of the context window for the same semantic content — a fundamental equity issue in multilingual models.

Token Budget

Modern models have fixed context windows measured in tokens, not words:

ModelContext WindowEnglish WordsChinese Characters
GPT-48,192 tokens~6,000 words~3,000 characters
GPT-4-128K128,000 tokens~96,000 words~48,000 characters

The same document might fit in English but overflow in Chinese — purely because of tokenization.


Exercises

  1. Manual BPE: Train BPE on ["low", "lower", "lowest", "newer", "newest"] with 5 merges. What vocabulary do you get?

    HintStart by counting character pairs across all words. The most frequent pairs will be related to the common endings (-er, -est) and beginnings (low, new).

  2. Tokenize an unknown word: How would BERT’s WordPiece tokenize “cryptocurrency” if it’s not in the vocabulary?

    AnswerWordPiece would greedily match from left to right: it might produce something like ["cry", "##pt", "##oc", "##ur", "##ren", "##cy"] depending on which subwords are in its 30K vocabulary. The key insight is that it never returns [UNK] for a word that can be decomposed into known character sequences.

  3. Token count estimation: Why does the sentence “I love natural language processing” likely produce more tokens than 5 (the word count)?

    AnswerWhile common words like “I”, “love”, “natural” are likely single tokens, longer/rarer words may be split: “processing” might become ["process", "##ing"]. The exact count depends on the tokenizer’s vocabulary.

Next Steps

With embeddings and tokenization covered, we now have the complete input pipeline: text → tokens → embeddings → model. Next we move to the attention mechanism that revolutionized NLP: Seq2Seq Attention →.