End-to-End Walkthrough
You know the individual components. Now let’s watch them work together.
Below is a simplified decoder-only transformer (like GPT). Type a prompt and expand each stage to see the actual numbers flowing through the model — from raw text to next-token prediction.
Interactive: Follow the Data
Try it yourself:
- Expand each stage (click the headers) to see intermediate values
- Adjust temperature — low (0.1) = confident, high (2.0) = creative
- Sample tokens to extend the text one word at a time
- Change the prompt to see how different inputs produce different predictions
What Just Happened? Stage by Stage
Stage 1: Tokenization
The raw text is split into tokens — words or subwords. Each token is mapped to an integer ID from the vocabulary.
| Step | Example |
|---|---|
| Raw text | ”The cat sat on” |
| Tokens | [“the”, “cat”, “sat”, “on”] |
| Token IDs | [0, 1, 2, 3] |
In real models, this uses BPE or WordPiece tokenization (see Module 6). Our toy model uses whole words.
Stage 2: Token Embedding
Each integer ID is looked up in an embedding table — a matrix of learned vectors. This converts sparse IDs into dense, meaningful representations.
At this point, “cat” and “sat” have fixed vectors — no context yet!
Stage 3: Positional Encoding
Self-attention is permutation-invariant — it can’t tell if “cat” came first or last. Adding sinusoidal positional encodings lets the model know token order.
Each position gets a unique signature of sine and cosine waves at different frequencies (see Module 10).
Stage 4: Self-Attention
This is where the magic happens. Each token queries all previous tokens to gather relevant context:
| Step | Operation | Result |
|---|---|---|
| 1 | Project each token into Q, K, V | Three vectors per token |
| 2 | Score: | Similarity between all pairs |
| 3 | Mask: block future tokens | Causal constraint for generation |
| 4 | Softmax: normalize scores | Attention weights (sum to 1) |
| 5 | Aggregate: weighted sum of V | Context-enriched representations |
After this stage, each token’s representation is enriched with context from the tokens that came before it. “on” now knows it follows “sat”, which follows “cat”.
Stage 5: Residual + LayerNorm + FFN
Two critical operations wrap around each sub-layer:
- Residual connection: Add the sub-layer output back to its input (). This ensures gradient flow through deep networks.
- Layer normalization: Normalize the vector to have mean ≈ 0 and std ≈ 1. This prevents activations from exploding.
The Feed-Forward Network then processes each token independently: expand to 4× width, apply ReLU, contract back.
In a real model, there would be many blocks stacked (GPT-2 has 12, GPT-3 has 96). Each block refines the representations further.
Stage 6: Prediction
The final hidden state of the last token is projected to vocabulary size, then softmax converts to probabilities:
The temperature controls randomness:
| Temperature | Behavior | Use Case |
|---|---|---|
| 0.1 | Very confident — picks the most likely token | Factual answers |
| 1.0 | Standard — samples from the learned distribution | General use |
| 2.0 | Creative — flattens distribution, more surprising | Brainstorming |
Real vs. Toy Model
Our visualization uses a tiny model for clarity. Here’s how it compares to real transformers:
| Parameter | Toy Model | GPT-2 Small | GPT-3 | LLaMA 70B |
|---|---|---|---|---|
| d_model | 8 | 768 | 12,288 | 8,192 |
| Layers | 1 | 12 | 96 | 80 |
| Heads | 2 | 12 | 96 | 64 |
| d_ff | 16 | 3,072 | 49,152 | 28,672 |
| Vocab size | 32 | 50,257 | 50,257 | 32,000 |
| Parameters | ~5K | 117M | 175B | 70B |
The architecture is identical — only the numbers change. Scale is what separates a toy demo from ChatGPT.
The Problem
In autoregressive generation, we run the full forward pass for every new token. But most of the computation is redundant — the attention K and V for previous tokens don’t change!
The Solution: KV Cache
Cache the K and V projections from previous steps. For each new token, only compute its new Q, K, V — then use the full cached K and V for attention.
This turns O(n²) per token into O(n), making generation much faster.
Memory Trade-off
The cache grows with sequence length:
- GPT-3 with 2048 tokens: ~3.2GB of KV cache per request
- This is why long-context models need more GPU memory
Next: Training & Inference
Now that you’ve seen a forward pass, learn how the model is trained and how inference works in practice. →