Architecture Overview

You’ve now learned every individual component of a transformer:

ComponentWhat It DoesModule
EmbeddingsConvert token IDs to dense vectors05
Positional EncodingAdd position information10
Self-AttentionLet tokens communicate08
Multi-Head AttentionMultiple attention perspectives in parallel09
Layer Norm + ResidualsKeep training stable11

There’s one piece we haven’t covered yet — the Feed-Forward Network — and then we’ll assemble everything.

The Missing Piece: Feed-Forward Network (FFN)

After self-attention lets tokens gather information from each other, each token needs to process that information independently. That’s the FFN’s job.

What It Does

The FFN applies two linear transformations with a non-linearity in between:

FFN(x)=ReLU(xW1+b1)W2+b2\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2

Think of it this way:

  • Attention = “gather information from other tokens” (communication)
  • FFN = “think about what you’ve gathered” (computation)

The Bottleneck Architecture

The FFN expands the dimension, then compresses it back:

StageShapeExample (dmodel=512d_{\text{model}}=512)
Input(seq_len, dmodeld_{\text{model}})(10, 512)
Expand(seq_len, dffd_{ff})(10, 2048) ← 4× wider!
↓ ReLU
Contract(seq_len, dmodeld_{\text{model}})(10, 512)

Why expand then contract? The wider hidden layer lets the network learn more complex transformations. It’s like brainstorming widely then filtering down to the best ideas.

A Surprising Fact

The FFN often has more parameters than the attention layers!

ComponentParametersFor d_model=512, d_ff=2048
Multi-Head Attention4 × d_model²4 × 512² = 1.05M
Feed-Forward Network2 × d_model × d_ff2 × 512 × 2048 = 2.10M

The FFN uses roughly twice the parameters of attention in each block.

🤔 Quick Check
Why does the FFN expand the dimension to 4× before contracting back?
class FeedForward:
    def __init__(self, d_model, d_ff):
        self.W1 = np.random.randn(d_model, d_ff) * 0.01
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * 0.01
        self.b2 = np.zeros(d_model)
    
    def forward(self, x):
        # Expand → ReLU → Contract
        hidden = np.maximum(0, x @ self.W1 + self.b1)  # ReLU
        return hidden @ self.W2 + self.b2

SwiGLU (Used in LLaMA, Gemma)

Modern transformers replace ReLU with SwiGLU — a gated activation:

SwiGLU(x)=(Swish(xW1)xV)W2\text{SwiGLU}(x) = (\text{Swish}(x W_1) \odot x V) W_2

Where Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x) and \odot is element-wise multiplication.

This adds a “gate” (the V projection) that controls which features pass through. It consistently outperforms ReLU with only a small parameter increase.

GeGLU (Used in some models)

GeGLU(x)=(GELU(xW1)xV)W2\text{GeGLU}(x) = (\text{GELU}(x W_1) \odot x V) W_2

Same idea but with GELU activation instead of Swish.


Anatomy of a Transformer Block

Every transformer block follows the same pattern — two sub-layers, each wrapped with a residual connection and layer normalization:

x=x+MultiHeadAttn(LN(x))x' = x + \text{MultiHeadAttn}(\text{LN}(x)) output=x+FFN(LN(x))\text{output} = x' + \text{FFN}(\text{LN}(x'))

This is the Pre-LN variant (used by GPT-2, GPT-3, LLaMA). The original 2017 paper used Post-LN (normalize after the residual add), but Pre-LN trains more stably.

class TransformerBlock:
    def __init__(self, d_model, n_heads, d_ff):
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.ln1 = LayerNorm(d_model)
        self.ln2 = LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        # Sub-layer 1: Attention
        x = x + self.attn(self.ln1(x), mask)
        
        # Sub-layer 2: FFN
        x = x + self.ffn(self.ln2(x))
        
        return x

Just 4 lines of code for the entire block! The simplicity is what makes transformers so elegant.


The Three Transformer Variants

Click through the variants below to see how the same building blocks assemble differently:

Transformer Architecture

Encoder
×N
Add & Norm
Feed Forward
Add & Norm
Multi-Head
Self-Attention
+
Positional
Encoding
Input
Embedding
Inputs
encoder output
Decoder
Output Probabilities
Softmax
Linear
×N
Add & Norm
Feed Forward
Add & Norm
Multi-Head
Cross-Attention
Add & Norm
Masked Multi-Head
Self-Attention
+
Positional
Encoding
Output
Embedding
Outputs (shifted right)

Click any component to learn more about it

Encoder-Decoder (Original, T5, BART)

The original “Attention Is All You Need” design:

  • Encoder: Processes the full input with bidirectional self-attention
  • Decoder: Generates output one token at a time with causal self-attention + cross-attention to encoder
  • Best for: Translation, summarization — tasks where input and output are different sequences

Encoder-Only (BERT)

Just the encoder stack:

  • Bidirectional attention: Every token sees every other token
  • Can’t generate text directly (no autoregressive decoding)
  • Best for: Classification, named entity recognition, sentence similarity
  • Key insight: Understanding a sentence benefits from seeing the full context

Decoder-Only (GPT, LLaMA, Claude)

Just the decoder stack (no cross-attention):

  • Causal attention: Each token can only see previous tokens
  • Generates text autoregressively — one token at a time
  • Best for: Text generation, chat, code completion — and it turns out, almost everything
  • Why it won: Simpler architecture + scales better + can do few-shot learning
🤔 Quick Check
ChatGPT, Claude, and LLaMA are all which type of transformer?

The Evolution of Transformers

YearModelArchitectureKey Innovation
2017Original TransformerEncoder-DecoderSelf-attention replaces recurrence
2018BERTEncoder-onlyBidirectional pre-training (MLM)
2018GPT-1Decoder-onlyAutoregressive pre-training
2019GPT-2Decoder-onlyLarger scale, zero-shot learning
2019T5Encoder-Decoder”Text-to-text” framework
2020GPT-3Decoder-only175B params, in-context learning
2023LLaMADecoder-onlyRoPE, RMSNorm, SwiGLU
2023Mistral 7BDecoder-onlySliding window attention, GQA
2024+GPT-4, Claude, GeminiDecoder-onlyRLHF, MoE, multimodal

The trend is clear: decoder-only architectures won. Nearly all modern LLMs use this simpler design.

RMSNorm (Replaces LayerNorm)

Simpler normalization that skips the mean subtraction: RMSNorm(x)=x1dixi2γ\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2}} \cdot \gamma

Used by: LLaMA, Gemma. ~10% faster than LayerNorm.

Grouped Query Attention (GQA)

Instead of each head having its own K, V projections, multiple heads share K, V:

  • Standard MHA: h heads × (Q, K, V) = 3h projections
  • GQA: h query heads, g groups of shared K, V = h + 2g projections

This saves memory during inference with minimal quality loss.

Sliding Window Attention

Instead of attending to all previous tokens, attend only to a local window. Mistral uses a window of 4096 tokens. Information still propagates globally across layers.

Mixture of Experts (MoE)

Replace the single FFN with multiple “expert” FFNs. A router picks which 2 experts handle each token. This lets you scale parameters without scaling computation proportionally. Used in Mixtral and (reportedly) GPT-4.


Next: End-to-End Walkthrough

Now that you understand the architecture, let’s trace data through a complete transformer — from input text to predicted next token. →