Architecture Overview

You’ve now learned every individual component of a transformer:

Component	What It Does	Module
Embeddings	Convert token IDs to dense vectors	05
Positional Encoding	Add position information	10
Self-Attention	Let tokens communicate	08
Multi-Head Attention	Multiple attention perspectives in parallel	09
Layer Norm + Residuals	Keep training stable	11

There’s one piece we haven’t covered yet — the Feed-Forward Network — and then we’ll assemble everything.

The Missing Piece: Feed-Forward Network (FFN)

After self-attention lets tokens gather information from each other, each token needs to process that information independently. That’s the FFN’s job.

What It Does

The FFN applies two linear transformations with a non-linearity in between:

$\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2$

Think of it this way:

Attention = “gather information from other tokens” (communication)
FFN = “think about what you’ve gathered” (computation)

The Bottleneck Architecture

The FFN expands the dimension, then compresses it back:

Stage	Shape	Example ( $d_{\text{model}}=512$ )
Input	(seq_len, $d_{\text{model}}$ )	(10, 512)
Expand	(seq_len, $d_{ff}$ )	(10, 2048) ← 4× wider!
↓ ReLU
Contract	(seq_len, $d_{\text{model}}$ )	(10, 512)

Why expand then contract? The wider hidden layer lets the network learn more complex transformations. It’s like brainstorming widely then filtering down to the best ideas.

A Surprising Fact

The FFN often has more parameters than the attention layers!

Component	Parameters	For d_model=512, d_ff=2048
Multi-Head Attention	4 × d_model²	4 × 512² = 1.05M
Feed-Forward Network	2 × d_model × d_ff	2 × 512 × 2048 = 2.10M

The FFN uses roughly twice the parameters of attention in each block.

🤔 Quick Check

Why does the FFN expand the dimension to 4× before contracting back?

class FeedForward:
    def __init__(self, d_model, d_ff):
        self.W1 = np.random.randn(d_model, d_ff) * 0.01
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * 0.01
        self.b2 = np.zeros(d_model)
    
    def forward(self, x):
        # Expand → ReLU → Contract
        hidden = np.maximum(0, x @ self.W1 + self.b1)  # ReLU
        return hidden @ self.W2 + self.b2

SwiGLU (Used in LLaMA, Gemma)

Modern transformers replace ReLU with SwiGLU — a gated activation:

$\text{SwiGLU}(x) = (\text{Swish}(x W_1) \odot x V) W_2$

Where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\odot$ is element-wise multiplication.

This adds a “gate” (the V projection) that controls which features pass through. It consistently outperforms ReLU with only a small parameter increase.

GeGLU (Used in some models)

$\text{GeGLU}(x) = (\text{GELU}(x W_1) \odot x V) W_2$

Same idea but with GELU activation instead of Swish.

Anatomy of a Transformer Block

Every transformer block follows the same pattern — two sub-layers, each wrapped with a residual connection and layer normalization:

$x' = x + \text{MultiHeadAttn}(\text{LN}(x))$ $\text{output} = x' + \text{FFN}(\text{LN}(x'))$

This is the Pre-LN variant (used by GPT-2, GPT-3, LLaMA). The original 2017 paper used Post-LN (normalize after the residual add), but Pre-LN trains more stably.

class TransformerBlock:
    def __init__(self, d_model, n_heads, d_ff):
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.ln1 = LayerNorm(d_model)
        self.ln2 = LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        # Sub-layer 1: Attention
        x = x + self.attn(self.ln1(x), mask)
        
        # Sub-layer 2: FFN
        x = x + self.ffn(self.ln2(x))
        
        return x

Just 4 lines of code for the entire block! The simplicity is what makes transformers so elegant.

The Three Transformer Variants

Click through the variants below to see how the same building blocks assemble differently:

Transformer Architecture

Encoder

×N

Add & Norm

Feed Forward

Add & Norm

Multi-Head
Self-Attention

Positional
Encoding

Input
Embedding

Inputs

encoder output

Decoder

Output Probabilities

Softmax

Linear

×N

Add & Norm

Feed Forward

Add & Norm

Multi-Head
Cross-Attention

Add & Norm

Masked Multi-Head
Self-Attention

Positional
Encoding

Output
Embedding

Outputs (shifted right)

Click any component to learn more about it

Encoder-Decoder (Original, T5, BART)

The original “Attention Is All You Need” design:

Encoder: Processes the full input with bidirectional self-attention
Decoder: Generates output one token at a time with causal self-attention + cross-attention to encoder
Best for: Translation, summarization — tasks where input and output are different sequences

Encoder-Only (BERT)

Just the encoder stack:

Bidirectional attention: Every token sees every other token
Can’t generate text directly (no autoregressive decoding)
Best for: Classification, named entity recognition, sentence similarity
Key insight: Understanding a sentence benefits from seeing the full context

Decoder-Only (GPT, LLaMA, Claude)

Just the decoder stack (no cross-attention):

Causal attention: Each token can only see previous tokens
Generates text autoregressively — one token at a time
Best for: Text generation, chat, code completion — and it turns out, almost everything
Why it won: Simpler architecture + scales better + can do few-shot learning

🤔 Quick Check

ChatGPT, Claude, and LLaMA are all which type of transformer?

The Evolution of Transformers

Year	Model	Architecture	Key Innovation
2017	Original Transformer	Encoder-Decoder	Self-attention replaces recurrence
2018	BERT	Encoder-only	Bidirectional pre-training (MLM)
2018	GPT-1	Decoder-only	Autoregressive pre-training
2019	GPT-2	Decoder-only	Larger scale, zero-shot learning
2019	T5	Encoder-Decoder	”Text-to-text” framework
2020	GPT-3	Decoder-only	175B params, in-context learning
2023	LLaMA	Decoder-only	RoPE, RMSNorm, SwiGLU
2023	Mistral 7B	Decoder-only	Sliding window attention, GQA
2024+	GPT-4, Claude, Gemini	Decoder-only	RLHF, MoE, multimodal

The trend is clear: decoder-only architectures won. Nearly all modern LLMs use this simpler design.

RMSNorm (Replaces LayerNorm)

Simpler normalization that skips the mean subtraction: $\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2}} \cdot \gamma$

Used by: LLaMA, Gemma. ~10% faster than LayerNorm.

Grouped Query Attention (GQA)

Instead of each head having its own K, V projections, multiple heads share K, V:

Standard MHA: h heads × (Q, K, V) = 3h projections
GQA: h query heads, g groups of shared K, V = h + 2g projections

This saves memory during inference with minimal quality loss.

Sliding Window Attention

Instead of attending to all previous tokens, attend only to a local window. Mistral uses a window of 4096 tokens. Information still propagates globally across layers.

Mixture of Experts (MoE)

Replace the single FFN with multiple “expert” FFNs. A router picks which 2 experts handle each token. This lets you scale parameters without scaling computation proportionally. Used in Mixtral and (reportedly) GPT-4.

Next: End-to-End Walkthrough

Now that you understand the architecture, let’s trace data through a complete transformer — from input text to predicted next token. →