End-to-End Walkthrough

You know the individual components. Now let’s watch them work together.

Below is a simplified decoder-only transformer (like GPT). Type a prompt and expand each stage to see the actual numbers flowing through the model — from raw text to next-token prediction.

Interactive: Follow the Data

Prompt:

Temp: 1.0

Split the input text into tokens and map each to an integer ID.

"the" ↓ ID: 0

"cat" ↓ ID: 1

"sat" ↓ ID: 2

"on" ↓ ID: 3

Try it yourself:

Expand each stage (click the headers) to see intermediate values
Adjust temperature — low (0.1) = confident, high (2.0) = creative
Sample tokens to extend the text one word at a time
Change the prompt to see how different inputs produce different predictions

What Just Happened? Stage by Stage

Stage 1: Tokenization

The raw text is split into tokens — words or subwords. Each token is mapped to an integer ID from the vocabulary.

Step	Example
Raw text	”The cat sat on”
Tokens	[“the”, “cat”, “sat”, “on”]
Token IDs	[0, 1, 2, 3]

In real models, this uses BPE or WordPiece tokenization (see Module 6). Our toy model uses whole words.

Stage 2: Token Embedding

Each integer ID is looked up in an embedding table — a matrix of learned vectors. This converts sparse IDs into dense, meaningful representations.

At this point, “cat” and “sat” have fixed vectors — no context yet!

Stage 3: Positional Encoding

Self-attention is permutation-invariant — it can’t tell if “cat” came first or last. Adding sinusoidal positional encodings lets the model know token order.

$x = \text{embedding}(\text{token}) + \text{PE}(\text{position})$

Each position gets a unique signature of sine and cosine waves at different frequencies (see Module 10).

Stage 4: Self-Attention

This is where the magic happens. Each token queries all previous tokens to gather relevant context:

Step	Operation	Result
1	Project each token into Q, K, V	Three vectors per token
2	Score: $QK^T / \sqrt{d_k}$	Similarity between all pairs
3	Mask: block future tokens	Causal constraint for generation
4	Softmax: normalize scores	Attention weights (sum to 1)
5	Aggregate: weighted sum of V	Context-enriched representations

After this stage, each token’s representation is enriched with context from the tokens that came before it. “on” now knows it follows “sat”, which follows “cat”.

Stage 5: Residual + LayerNorm + FFN

Two critical operations wrap around each sub-layer:

Residual connection: Add the sub-layer output back to its input ( $x + \text{sublayer}(x)$ ). This ensures gradient flow through deep networks.
Layer normalization: Normalize the vector to have mean ≈ 0 and std ≈ 1. This prevents activations from exploding.

The Feed-Forward Network then processes each token independently: expand to 4× width, apply ReLU, contract back.

In a real model, there would be many blocks stacked (GPT-2 has 12, GPT-3 has 96). Each block refines the representations further.

Stage 6: Prediction

The final hidden state of the last token is projected to vocabulary size, then softmax converts to probabilities:

$P(\text{next token}) = \text{softmax}(\text{hidden} \cdot W_{\text{output}} / T)$

The temperature $T$ controls randomness:

Temperature	Behavior	Use Case
0.1	Very confident — picks the most likely token	Factual answers
1.0	Standard — samples from the learned distribution	General use
2.0	Creative — flattens distribution, more surprising	Brainstorming

🤔 Quick Check

After generating one token, what happens next in autoregressive generation?

Real vs. Toy Model

Our visualization uses a tiny model for clarity. Here’s how it compares to real transformers:

Parameter	Toy Model	GPT-2 Small	GPT-3	LLaMA 70B
d_model	8	768	12,288	8,192
Layers	1	12	96	80
Heads	2	12	96	64
d_ff	16	3,072	49,152	28,672
Vocab size	32	50,257	50,257	32,000
Parameters	~5K	117M	175B	70B

The architecture is identical — only the numbers change. Scale is what separates a toy demo from ChatGPT.

The Problem

In autoregressive generation, we run the full forward pass for every new token. But most of the computation is redundant — the attention K and V for previous tokens don’t change!

The Solution: KV Cache

Cache the K and V projections from previous steps. For each new token, only compute its new Q, K, V — then use the full cached K and V for attention.

This turns O(n²) per token into O(n), making generation much faster.

Memory Trade-off

The cache grows with sequence length:

GPT-3 with 2048 tokens: ~3.2GB of KV cache per request
This is why long-context models need more GPU memory

Next: Training & Inference

Now that you’ve seen a forward pass, learn how the model is trained and how inference works in practice. →