End-to-End Walkthrough

You know the individual components. Now let’s watch them work together.

Below is a simplified decoder-only transformer (like GPT). Type a prompt and expand each stage to see the actual numbers flowing through the model — from raw text to next-token prediction.

Interactive: Follow the Data

Split the input text into tokens and map each to an integer ID.

"the" ID: 0
"cat" ID: 1
"sat" ID: 2
"on" ID: 3

Try it yourself:

  1. Expand each stage (click the headers) to see intermediate values
  2. Adjust temperature — low (0.1) = confident, high (2.0) = creative
  3. Sample tokens to extend the text one word at a time
  4. Change the prompt to see how different inputs produce different predictions

What Just Happened? Stage by Stage

Stage 1: Tokenization

The raw text is split into tokens — words or subwords. Each token is mapped to an integer ID from the vocabulary.

StepExample
Raw text”The cat sat on”
Tokens[“the”, “cat”, “sat”, “on”]
Token IDs[0, 1, 2, 3]

In real models, this uses BPE or WordPiece tokenization (see Module 6). Our toy model uses whole words.

Stage 2: Token Embedding

Each integer ID is looked up in an embedding table — a matrix of learned vectors. This converts sparse IDs into dense, meaningful representations.

At this point, “cat” and “sat” have fixed vectors — no context yet!

Stage 3: Positional Encoding

Self-attention is permutation-invariant — it can’t tell if “cat” came first or last. Adding sinusoidal positional encodings lets the model know token order.

x=embedding(token)+PE(position)x = \text{embedding}(\text{token}) + \text{PE}(\text{position})

Each position gets a unique signature of sine and cosine waves at different frequencies (see Module 10).

Stage 4: Self-Attention

This is where the magic happens. Each token queries all previous tokens to gather relevant context:

StepOperationResult
1Project each token into Q, K, VThree vectors per token
2Score: QKT/dkQK^T / \sqrt{d_k}Similarity between all pairs
3Mask: block future tokensCausal constraint for generation
4Softmax: normalize scoresAttention weights (sum to 1)
5Aggregate: weighted sum of VContext-enriched representations

After this stage, each token’s representation is enriched with context from the tokens that came before it. “on” now knows it follows “sat”, which follows “cat”.

Stage 5: Residual + LayerNorm + FFN

Two critical operations wrap around each sub-layer:

  1. Residual connection: Add the sub-layer output back to its input (x+sublayer(x)x + \text{sublayer}(x)). This ensures gradient flow through deep networks.
  2. Layer normalization: Normalize the vector to have mean ≈ 0 and std ≈ 1. This prevents activations from exploding.

The Feed-Forward Network then processes each token independently: expand to 4× width, apply ReLU, contract back.

In a real model, there would be many blocks stacked (GPT-2 has 12, GPT-3 has 96). Each block refines the representations further.

Stage 6: Prediction

The final hidden state of the last token is projected to vocabulary size, then softmax converts to probabilities:

P(next token)=softmax(hiddenWoutput/T)P(\text{next token}) = \text{softmax}(\text{hidden} \cdot W_{\text{output}} / T)

The temperature TT controls randomness:

TemperatureBehaviorUse Case
0.1Very confident — picks the most likely tokenFactual answers
1.0Standard — samples from the learned distributionGeneral use
2.0Creative — flattens distribution, more surprisingBrainstorming
🤔 Quick Check
After generating one token, what happens next in autoregressive generation?

Real vs. Toy Model

Our visualization uses a tiny model for clarity. Here’s how it compares to real transformers:

ParameterToy ModelGPT-2 SmallGPT-3LLaMA 70B
d_model876812,2888,192
Layers1129680
Heads2129664
d_ff163,07249,15228,672
Vocab size3250,25750,25732,000
Parameters~5K117M175B70B

The architecture is identical — only the numbers change. Scale is what separates a toy demo from ChatGPT.

The Problem

In autoregressive generation, we run the full forward pass for every new token. But most of the computation is redundant — the attention K and V for previous tokens don’t change!

The Solution: KV Cache

Cache the K and V projections from previous steps. For each new token, only compute its new Q, K, V — then use the full cached K and V for attention.

This turns O(n²) per token into O(n), making generation much faster.

Memory Trade-off

The cache grows with sequence length:

  • GPT-3 with 2048 tokens: ~3.2GB of KV cache per request
  • This is why long-context models need more GPU memory

Next: Training & Inference

Now that you’ve seen a forward pass, learn how the model is trained and how inference works in practice. →