Architecture Overview
You’ve now learned every individual component of a transformer:
| Component | What It Does | Module |
|---|---|---|
| Embeddings | Convert token IDs to dense vectors | 05 |
| Positional Encoding | Add position information | 10 |
| Self-Attention | Let tokens communicate | 08 |
| Multi-Head Attention | Multiple attention perspectives in parallel | 09 |
| Layer Norm + Residuals | Keep training stable | 11 |
There’s one piece we haven’t covered yet — the Feed-Forward Network — and then we’ll assemble everything.
The Missing Piece: Feed-Forward Network (FFN)
After self-attention lets tokens gather information from each other, each token needs to process that information independently. That’s the FFN’s job.
What It Does
The FFN applies two linear transformations with a non-linearity in between:
Think of it this way:
- Attention = “gather information from other tokens” (communication)
- FFN = “think about what you’ve gathered” (computation)
The Bottleneck Architecture
The FFN expands the dimension, then compresses it back:
| Stage | Shape | Example () |
|---|---|---|
| Input | (seq_len, ) | (10, 512) |
| Expand | (seq_len, ) | (10, 2048) ← 4× wider! |
| ↓ ReLU | ||
| Contract | (seq_len, ) | (10, 512) |
Why expand then contract? The wider hidden layer lets the network learn more complex transformations. It’s like brainstorming widely then filtering down to the best ideas.
A Surprising Fact
The FFN often has more parameters than the attention layers!
| Component | Parameters | For d_model=512, d_ff=2048 |
|---|---|---|
| Multi-Head Attention | 4 × d_model² | 4 × 512² = 1.05M |
| Feed-Forward Network | 2 × d_model × d_ff | 2 × 512 × 2048 = 2.10M |
The FFN uses roughly twice the parameters of attention in each block.
class FeedForward:
def __init__(self, d_model, d_ff):
self.W1 = np.random.randn(d_model, d_ff) * 0.01
self.b1 = np.zeros(d_ff)
self.W2 = np.random.randn(d_ff, d_model) * 0.01
self.b2 = np.zeros(d_model)
def forward(self, x):
# Expand → ReLU → Contract
hidden = np.maximum(0, x @ self.W1 + self.b1) # ReLU
return hidden @ self.W2 + self.b2SwiGLU (Used in LLaMA, Gemma)
Modern transformers replace ReLU with SwiGLU — a gated activation:
Where and is element-wise multiplication.
This adds a “gate” (the V projection) that controls which features pass through. It consistently outperforms ReLU with only a small parameter increase.
GeGLU (Used in some models)
Same idea but with GELU activation instead of Swish.
Anatomy of a Transformer Block
Every transformer block follows the same pattern — two sub-layers, each wrapped with a residual connection and layer normalization:
This is the Pre-LN variant (used by GPT-2, GPT-3, LLaMA). The original 2017 paper used Post-LN (normalize after the residual add), but Pre-LN trains more stably.
class TransformerBlock:
def __init__(self, d_model, n_heads, d_ff):
self.attn = MultiHeadAttention(d_model, n_heads)
self.ffn = FeedForward(d_model, d_ff)
self.ln1 = LayerNorm(d_model)
self.ln2 = LayerNorm(d_model)
def forward(self, x, mask=None):
# Sub-layer 1: Attention
x = x + self.attn(self.ln1(x), mask)
# Sub-layer 2: FFN
x = x + self.ffn(self.ln2(x))
return xJust 4 lines of code for the entire block! The simplicity is what makes transformers so elegant.
The Three Transformer Variants
Click through the variants below to see how the same building blocks assemble differently:
Transformer Architecture
Self-Attention
Encoding
Cross-Attention
Self-Attention
Encoding
Click any component to learn more about it
Encoder-Decoder (Original, T5, BART)
The original “Attention Is All You Need” design:
- Encoder: Processes the full input with bidirectional self-attention
- Decoder: Generates output one token at a time with causal self-attention + cross-attention to encoder
- Best for: Translation, summarization — tasks where input and output are different sequences
Encoder-Only (BERT)
Just the encoder stack:
- Bidirectional attention: Every token sees every other token
- Can’t generate text directly (no autoregressive decoding)
- Best for: Classification, named entity recognition, sentence similarity
- Key insight: Understanding a sentence benefits from seeing the full context
Decoder-Only (GPT, LLaMA, Claude)
Just the decoder stack (no cross-attention):
- Causal attention: Each token can only see previous tokens
- Generates text autoregressively — one token at a time
- Best for: Text generation, chat, code completion — and it turns out, almost everything
- Why it won: Simpler architecture + scales better + can do few-shot learning
The Evolution of Transformers
| Year | Model | Architecture | Key Innovation |
|---|---|---|---|
| 2017 | Original Transformer | Encoder-Decoder | Self-attention replaces recurrence |
| 2018 | BERT | Encoder-only | Bidirectional pre-training (MLM) |
| 2018 | GPT-1 | Decoder-only | Autoregressive pre-training |
| 2019 | GPT-2 | Decoder-only | Larger scale, zero-shot learning |
| 2019 | T5 | Encoder-Decoder | ”Text-to-text” framework |
| 2020 | GPT-3 | Decoder-only | 175B params, in-context learning |
| 2023 | LLaMA | Decoder-only | RoPE, RMSNorm, SwiGLU |
| 2023 | Mistral 7B | Decoder-only | Sliding window attention, GQA |
| 2024+ | GPT-4, Claude, Gemini | Decoder-only | RLHF, MoE, multimodal |
The trend is clear: decoder-only architectures won. Nearly all modern LLMs use this simpler design.
RMSNorm (Replaces LayerNorm)
Simpler normalization that skips the mean subtraction:
Used by: LLaMA, Gemma. ~10% faster than LayerNorm.
Grouped Query Attention (GQA)
Instead of each head having its own K, V projections, multiple heads share K, V:
- Standard MHA: h heads × (Q, K, V) = 3h projections
- GQA: h query heads, g groups of shared K, V = h + 2g projections
This saves memory during inference with minimal quality loss.
Sliding Window Attention
Instead of attending to all previous tokens, attend only to a local window. Mistral uses a window of 4096 tokens. Information still propagates globally across layers.
Mixture of Experts (MoE)
Replace the single FFN with multiple “expert” FFNs. A router picks which 2 experts handle each token. This lets you scale parameters without scaling computation proportionally. Used in Mixtral and (reportedly) GPT-4.
Next: End-to-End Walkthrough
Now that you understand the architecture, let’s trace data through a complete transformer — from input text to predicted next token. →