Layer Norm & Residuals

Why Do We Need These?

Imagine building GPT-3 with 96 layers. Without two critical tricks, your model would be completely untrainable:

  1. Without normalization: Activations explode to infinity or collapse to zero as data flows through dozens of layers
  2. Without residual connections: Gradients vanish during backpropagation — early layers receive near-zero gradients and can’t learn

These aren’t fancy optimizations — they’re essential prerequisites for any deep transformer.


Residual Connections: The Gradient Highway

The Problem

In a deep network without skip connections, the gradient must flow backward through every layer’s transformation:

outputinput=fNfN1f1\frac{\partial \text{output}}{\partial \text{input}} = f'_N \cdot f'_{N-1} \cdot \ldots \cdot f'_1

If each fif'_i is slightly less than 1, the product vanishes exponentially. Sound familiar? It’s the same vanishing gradient problem we saw with RNNs — but across layers instead of across time!

The Solution: Skip Connections

output=x+f(x)\text{output} = x + f(x)

Instead of replacing xx with f(x)f(x), we add f(x)f(x) to the original xx. The gradient through this operation is:

(x+f(x))x=1+fx\frac{\partial (x + f(x))}{\partial x} = 1 + \frac{\partial f}{\partial x}

Even if fx\frac{\partial f}{\partial x} is tiny, the gradient is still approximately 1! The identity connection creates a “highway” for gradients to flow back unimpeded.

See It In Action

Toggle between “With Residuals” and “Without Residuals” to see how gradient magnitude changes across layers. Try increasing the layer count to see how the problem gets worse:

Gradient Flow Through Layers

Gradient Magnitude
0.82
L1
0.83
L2
0.85
L3
0.87
L4
0.89
L5
0.90
L6
0.92
L7
0.94
L8
0.96
L9
0.98
L10
1.00
Out
✅ With residual connections: Gradients flow freely through the "+" shortcut. Even with 10 layers, the gradient at layer 1 is 0.82 — the model can learn!
∂(x + f(x))/∂x = 1 + ∂f/∂x → gradient is always ≥ 1
🤔 Quick Check
What would happen if you trained a 50-layer transformer without residual connections?

Layer Normalization

The Problem: Internal Covariate Shift

As activations flow through layers, their distribution shifts:

LayerMeanStd DevStatus
10.01.0✅ Healthy
53.215.7⚠️ Drifting
10874312❌ Exploding!

Later layers must constantly adapt to a moving target. Training becomes slow and unstable.

The Solution: Normalize Each Vector

Layer normalization normalizes each vector independently across its features:

LayerNorm(x)=γxμσ2+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where:

  • μ=1dixi\mu = \frac{1}{d}\sum_i x_i (mean of the vector)
  • σ2=1di(xiμ)2\sigma^2 = \frac{1}{d}\sum_i (x_i - \mu)^2 (variance)
  • γ,β\gamma, \beta are learnable parameters (scale and shift)
  • ϵ\epsilon is a small constant (e.g., 10610^{-6}) for numerical stability

Interactive: Watch Normalization Step by Step

Click through the tabs to see how layer normalization transforms a vector. Adjust γ and β to see the effect of the learned scale and shift:

Layer Normalization Step by Step

Raw activations (range: -3.0 to 2.2)
d0
-2.95
d1
-1.15
d2
1.31
d3
-1.40
d4
-1.97
d5
1.31
d6
2.24
d7
-1.57
μ = -0.522 σ = 1.753
LayerNorm(x) = γ · (x − μ) / σ + β
class LayerNorm:
    def __init__(self, d_model, eps=1e-6):
        self.gamma = np.ones(d_model)   # Learnable scale
        self.beta = np.zeros(d_model)   # Learnable shift
        self.eps = eps
    
    def forward(self, x):
        mean = x.mean(axis=-1, keepdims=True)
        var = x.var(axis=-1, keepdims=True)
        
        x_norm = (x - mean) / np.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

Why Not Batch Normalization?

You might know Batch Normalization from CNNs. It normalizes across the batch dimension. But for transformers:

AspectBatchNormLayerNorm
Normalizes acrossBatch (examples)Features (dimensions)
RequiresLarge batch sizeWorks with any batch size
At inferenceNeeds running statisticsNo extra state needed
For sequencesProblematic (variable length)Works naturally

LayerNorm is simpler and works better for transformers.


Putting It Together: The Transformer Block

Every transformer block uses both tricks. The standard pattern (Pre-LN):

x=x+Attention(LN(x))x' = x + \text{Attention}(\text{LN}(x)) output=x+FFN(LN(x))\text{output} = x' + \text{FFN}(\text{LN}(x'))

Pre-LN vs Post-LN

Pre-LN (modern)Post-LN (original)
OrderLN → Sublayer → AddSublayer → Add → LN
StabilityMore stableLess stable
Warmup neededOptionalCritical
Used byGPT-2, GPT-3, LLaMAOriginal Transformer, BERT
class TransformerBlock:
    def __init__(self, d_model, n_heads, d_ff):
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.ln1 = LayerNorm(d_model)
        self.ln2 = LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        # Sub-layer 1: Attention with residual + norm
        x = x + self.attn(self.ln1(x), mask)
        
        # Sub-layer 2: FFN with residual + norm
        x = x + self.ffn(self.ln2(x))
        
        return x

Formula

RMSNorm(x)=x1dixi2+ϵγ\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2 + \epsilon}} \cdot \gamma

RMSNorm skips the mean subtraction — it only normalizes by the root-mean-square. No bias parameter either.

Why use it? About 10% faster than LayerNorm (one fewer operation), and empirically works just as well.

Used by: LLaMA, LLaMA 2, Gemma, Mistral

class RMSNorm:
    def __init__(self, d_model, eps=1e-6):
        self.gamma = np.ones(d_model)
        self.eps = eps
    
    def forward(self, x):
        rms = np.sqrt(np.mean(x**2, axis=-1, keepdims=True) + self.eps)
        return self.gamma * x / rms

Summary

TechniqueWhat It DoesWhy It Matters
Residual Connectionx+f(x)x + f(x)Gradient highway — prevents vanishing gradients
Layer NormalizationNormalize each vector to mean 0, std 1Prevents activation explosion, stabilizes training
TogetherUsed in every transformer blockMakes 96+ layer networks trainable

Key Equations

Residual Connection

output=x+sublayer(x)\text{output} = x + \text{sublayer}(x)

Layer Normalization

LN(x)=γxμσ2+ϵ+β\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

Pre-LN Transformer Block

x=x+Attention(LN(x))x' = x + \text{Attention}(\text{LN}(x)) output=x+FFN(LN(x))\text{output} = x' + \text{FFN}(\text{LN}(x'))


Next: The Full Transformer

You now know every component. It’s time to put them all together into the complete transformer architecture →