Vanishing Gradients & Gated RNNs

In the backward pass, we saw how gradients flow through time via repeated multiplication by $W_{hh}^T$ . Now we’ll see why this causes a fundamental problem — and the elegant architectural solutions that fix it.

The Problem

Train an RNN on long sequences and plot gradient magnitude at each timestep:

Gradient Magnitude Through Time

|λ| = 0.90 T = 20

||dh||

6e-3

t=0

8e-3

0.01

0.02

0.03

0.04

0.05

0.07

0.09

0.12

0.15

0.20

0.26

0.34

0.45

0.59

0.77

1.0

t=19

Gradients decaying: With |λ| = 0.90, gradients shrink to 0.006 at t=0. Short sequences work, but long-range dependencies are hard to learn.

Effective per-step factor: |λ| × tanh'(h) ≈ 0.77 — over 19 steps: 0.77^19 = 0.0062

Try it yourself: Drag the |λ| slider below 0.9 to see gradients vanish, or above 1.2 to see them explode. Then increase the sequence length T to see how the problem gets worse with longer sequences.

Consequence: early tokens don’t learn. The network can’t capture long-range dependencies.

Mathematical Analysis

Gradient Flow Through Time

From BPTT, each backward step multiplies by:

$dh_{t-1} = W_{hh}^T \cdot \text{diag}(1 - h_t^2) \cdot dh_t$

Over the full sequence:

$dh_1 = \prod_{t=2}^{T} \left[ W_{hh}^T \cdot \text{diag}(1 - h_t^2) \right] \cdot dh_T$

The gradient’s fate depends on two factors:

1. Eigenvalues of $W_{hh}$

Let $\lambda$ be the largest eigenvalue:

| $|\lambda|$ | Gradient behavior | After 100 steps | |---|---|---| | 0.9 | Vanishes exponentially | $0.9^{100} \approx 0.00003$ | | 1.0 | Preserved (ideal but unstable) | $1.0^{100} = 1$ | | 1.1 | Explodes exponentially | $1.1^{100} \approx 13{,}780$ |

With typical random initialization ( $W \sim \mathcal{N}(0, 0.01)$ ), eigenvalues are almost always $< 1$ → vanishing is the default.

2. Tanh Saturation

The tanh derivative $(1 - h_t^2)$ further attenuates gradients at every step:

$h_t$	$(1 - h_t^2)$	Effect
0.0	1.0	Full gradient flow
±0.5	0.75	Mild attenuation
±0.9	0.19	Heavy attenuation
±0.99	0.02	98% gradient loss per step

These two effects compound: even if eigenvalues are near 1, saturated tanh neurons still kill the gradient.

Why This Matters

Consider: “The cat, which was sitting on the mat in the corner of the room, was sleeping.”

The subject “cat” determines “was” (not “were”), but they’re 14 words apart. With vanishing gradients, the signal from “was” barely reaches “cat” — the model can’t learn this dependency.

Practical Limits of Vanilla RNNs

Sequence Length	Can RNN Learn Long-Range Dependencies?
< 20 tokens	Usually works ✅
20–50 tokens	Struggles ⚠️
> 50 tokens	Fails ❌

🤔 Quick Check

Gradient clipping solves the vanishing gradient problem. True or false?

Attempted Fixes (Partial)

Fix	Helps with	Limitation
Gradient clipping	Explosion ✅	Can’t fix vanishing ❌
Orthogonal initialization	Initial gradient flow ✅	Training changes $W_{hh}$ , orthogonality lost ❌
Careful learning rate	Stability ✅	Doesn’t address root cause ❌

The real fix requires a change in architecture — replacing multiplicative state updates with additive ones.

The Solution: LSTM

The Key Insight: Addition vs Multiplication

In vanilla RNN, the hidden state is overwritten at each step via matrix multiplication — gradients flow through tanh and decay.

LSTM adds a cell state updated via addition:

$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$

The $+$ creates a gradient highway: if $f_t \approx 1$ , gradients pass through unchanged.

The Four Gates

Gate	Symbol	Formula	Purpose
Forget	$f_t$	$\sigma(W_f [h_{t-1}, x_t] + b_f)$	How much old memory to keep
Input	$i_t$	$\sigma(W_i [h_{t-1}, x_t] + b_i)$	How much new info to write
Candidate	$\tilde{c}_t$	$\tanh(W_c [h_{t-1}, x_t] + b_c)$	What new info to write
Output	$o_t$	$\sigma(W_o [h_{t-1}, x_t] + b_o)$	What to expose as output

Data Flow

c_{t-1} ──(×)────────────(+)────── c_t
           ↑               ↑
          f_t             i_t × c̃_t
           ↑               ↑
        ┌──────────────────────┐
        │   σ   σ  tanh   σ   │
        │   f   i    c̃   o   │
        └──────────────────────┘
              ↑         ↑
           h_{t-1}     x_t

h_t = o_t × tanh(c_t)

The cell state $c_t$ flows along the top like a conveyor belt — information can be added or removed via gates, but the default path is to simply pass through unchanged.

✍️ Fill in the Blanks

The LSTM cell state is updated via (not multiplication), which creates a gradient that prevents vanishing.

Why LSTM Prevents Vanishing Gradients

In the backward pass, the cell state gradient is simply:

$\frac{\partial c_t}{\partial c_{t-1}} = f_t$

Compare the two architectures over 100 timesteps:

Architecture	Gradient formula	After 100 steps
Vanilla RNN	$\prod W_{hh}^T \cdot \text{diag}(1 - h^2)$	$\approx 0.00003$ (vanished)
LSTM	$\prod f_t$	$0.99^{100} \approx 0.37$ (10,000× better)

The forget gate $f_t$ is a learned scalar near 1, not a full matrix multiplication through tanh. This is the fundamental difference.

GRU: A Simpler Alternative

GRU achieves similar results with only 2 gates and 1 state (no separate cell state):

GRU Equations

$r_t = \sigma(W_r [h_{t-1}, x_t] + b_r) \quad \text{(reset gate)}$ $z_t = \sigma(W_z [h_{t-1}, x_t] + b_z) \quad \text{(update gate)}$ $\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h) \quad \text{(candidate)}$ $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \quad \text{(interpolate)}$

The update gate $z_t$ plays the role of both the forget and input gates — when $z_t \approx 0$ , the old state passes through unchanged (like $f_t \approx 1$ in LSTM).

LSTM vs GRU

Aspect	LSTM	GRU
Gates	4 (forget, input, candidate, output)	2 (reset, update)
States	2 ( $h$ and $c$ )	1 ( $h$ only)
Parameters	More	~25% fewer
Performance	Similar	Similar
Best for	Maximum capacity	Speed / simplicity

🤔 Quick Check

What happens in an LSTM if the forget gate f_t = 1 for all timesteps?

Key Equations

LSTM

$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$ $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$ $\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)$ $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$ $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$ $h_t = o_t \odot \tanh(c_t)$

GRU

$r_t = \sigma(W_r [h_{t-1}, x_t] + b_r)$ $z_t = \sigma(W_z [h_{t-1}, x_t] + b_z)$ $\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h)$ $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

class LSTMCell:
    def __init__(self, input_size, hidden_size):
        self.hidden_size = hidden_size
        scale = 0.1
        # Combined weights: [W_f, W_i, W_c, W_o] stacked for efficiency
        self.W = np.random.randn(4 * hidden_size, input_size + hidden_size) * scale
        self.b = np.zeros(4 * hidden_size)
        self.b[:hidden_size] = 1.0  # Forget gate bias = 1 (remember by default)
    
    def forward(self, x_t, h_prev, c_prev):
        H = self.hidden_size
        concat = np.concatenate([h_prev, x_t])
        gates = self.W @ concat + self.b  # All gates in one matmul
        
        f_t = sigmoid(gates[0*H:1*H])      # Forget
        i_t = sigmoid(gates[1*H:2*H])      # Input
        c_tilde = np.tanh(gates[2*H:3*H])  # Candidate
        o_t = sigmoid(gates[3*H:4*H])      # Output
        
        c_t = f_t * c_prev + i_t * c_tilde  # Cell state (ADDITION!)
        h_t = o_t * np.tanh(c_t)
        return h_t, c_t

Connection to Transformers

In self-attention, the gradient from token $j$ to token $i$ is:

$\frac{\partial \text{output}_j}{\partial \text{input}_i} = \alpha_{j,i} \cdot W_V$

This is a single multiplication — not a product over $|j-i|$ steps:

Architecture	Gradient path for distance $d$	Multiplications
Vanilla RNN	$\prod_{k=1}^{d} W_{hh}^T \cdot \text{diag}(1 - h_k^2)$	$d$ (vanishes)
LSTM	$\prod_{k=1}^{d} f_k$	$d$ (much better)
Transformer	$\alpha_{j,i} \cdot W_V$	1 (constant!)

This is a fundamental reason why Transformers replaced RNNs for most tasks.

When to Use What

Need to model sequences?
├── Sequence length < 50? → Vanilla RNN might work
├── Can parallelize?
│   ├── Yes → Transformer (preferred)
│   └── No → LSTM/GRU
└── Training speed critical?
    ├── Yes → GRU
    └── No → LSTM

Modern Perspective

Since 2017, Transformers have largely replaced LSTMs for most NLP tasks. LSTMs remain useful for:

Streaming/online processing (one token at a time)
Memory-constrained deployments
Time series with strict ordering

Exercises

Eigenvalue experiment: If $W_{hh}$ has max eigenvalue 0.95, what fraction of the gradient remains after 50 timesteps?
Answer
$0.95^{50} \approx 0.077$ — only 7.7% of the gradient survives. After 100 steps: $0.95^{100} \approx 0.006$ — essentially gone.
Forget gate analysis: What happens if $f_t = 0$ for all timesteps? What if $f_t = 1$ ?
Answer
$f_t = 0$ : Cell state resets each step (like vanilla RNN). $f_t = 1$ : Cell state accumulates forever (perfect memory).
GRU as LSTM: Show that GRU is roughly equivalent to LSTM with: forget and input gates tied ( $f_t = 1 - i_t$ ), no output gate ( $o_t = 1$ ), and cell state = hidden state.
Parameter count: For input_size=256 and hidden_size=512, how many parameters does an LSTM cell have vs a GRU cell?
Answer
LSTM: $4 \times 512 \times (256 + 512) + 4 \times 512 = 1{,}574{,}912$ . GRU: $3 \times 512 \times (256 + 512) + 3 \times 512 = 1{,}181{,}184$ . GRU is ~25% smaller.

Next Steps

With RNN architectures complete, we turn to how tokens enter the network. One-hot vectors are wasteful — learned embeddings are better. Next: Embeddings →