Vanishing Gradients & Gated RNNs

In the backward pass, we saw how gradients flow through time via repeated multiplication by WhhTW_{hh}^T. Now we’ll see why this causes a fundamental problem — and the elegant architectural solutions that fix it.

The Problem

Train an RNN on long sequences and plot gradient magnitude at each timestep:

Gradient Magnitude Through Time

||dh||
6e-3
t=0
8e-3
0.01
0.01
0.02
0.02
5
0.03
0.04
0.05
0.07
0.09
10
0.12
0.15
0.20
0.26
0.34
15
0.45
0.59
0.77
1.0
t=19
Gradients decaying: With |λ| = 0.90, gradients shrink to 0.006 at t=0. Short sequences work, but long-range dependencies are hard to learn.
Effective per-step factor: |λ| × tanh'(h) ≈ 0.77 — over 19 steps: 0.77^19 = 0.0062

Try it yourself: Drag the |λ| slider below 0.9 to see gradients vanish, or above 1.2 to see them explode. Then increase the sequence length T to see how the problem gets worse with longer sequences.

Consequence: early tokens don’t learn. The network can’t capture long-range dependencies.


Mathematical Analysis

Gradient Flow Through Time

From BPTT, each backward step multiplies by:

dht1=WhhTdiag(1ht2)dhtdh_{t-1} = W_{hh}^T \cdot \text{diag}(1 - h_t^2) \cdot dh_t

Over the full sequence:

dh1=t=2T[WhhTdiag(1ht2)]dhTdh_1 = \prod_{t=2}^{T} \left[ W_{hh}^T \cdot \text{diag}(1 - h_t^2) \right] \cdot dh_T

The gradient’s fate depends on two factors:

1. Eigenvalues of WhhW_{hh}

Let λ\lambda be the largest eigenvalue:

| λ|\lambda| | Gradient behavior | After 100 steps | |---|---|---| | 0.9 | Vanishes exponentially | 0.91000.000030.9^{100} \approx 0.00003 | | 1.0 | Preserved (ideal but unstable) | 1.0100=11.0^{100} = 1 | | 1.1 | Explodes exponentially | 1.110013,7801.1^{100} \approx 13{,}780 |

With typical random initialization (WN(0,0.01)W \sim \mathcal{N}(0, 0.01)), eigenvalues are almost always <1< 1 → vanishing is the default.

2. Tanh Saturation

The tanh derivative (1ht2)(1 - h_t^2) further attenuates gradients at every step:

hth_t(1ht2)(1 - h_t^2)Effect
0.01.0Full gradient flow
±0.50.75Mild attenuation
±0.90.19Heavy attenuation
±0.990.0298% gradient loss per step

These two effects compound: even if eigenvalues are near 1, saturated tanh neurons still kill the gradient.


Why This Matters

Consider: “The cat, which was sitting on the mat in the corner of the room, was sleeping.”

The subject “cat” determines “was” (not “were”), but they’re 14 words apart. With vanishing gradients, the signal from “was” barely reaches “cat” — the model can’t learn this dependency.

Practical Limits of Vanilla RNNs

Sequence LengthCan RNN Learn Long-Range Dependencies?
< 20 tokensUsually works ✅
20–50 tokensStruggles ⚠️
> 50 tokensFails ❌
🤔 Quick Check
Gradient clipping solves the vanishing gradient problem. True or false?

Attempted Fixes (Partial)

FixHelps withLimitation
Gradient clippingExplosion ✅Can’t fix vanishing ❌
Orthogonal initializationInitial gradient flow ✅Training changes WhhW_{hh}, orthogonality lost ❌
Careful learning rateStability ✅Doesn’t address root cause ❌

The real fix requires a change in architecture — replacing multiplicative state updates with additive ones.


The Solution: LSTM

The Key Insight: Addition vs Multiplication

In vanilla RNN, the hidden state is overwritten at each step via matrix multiplication — gradients flow through tanh and decay.

LSTM adds a cell state updated via addition:

ct=ftct1+itc~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

The ++ creates a gradient highway: if ft1f_t \approx 1, gradients pass through unchanged.

The Four Gates

GateSymbolFormulaPurpose
Forgetftf_tσ(Wf[ht1,xt]+bf)\sigma(W_f [h_{t-1}, x_t] + b_f)How much old memory to keep
Inputiti_tσ(Wi[ht1,xt]+bi)\sigma(W_i [h_{t-1}, x_t] + b_i)How much new info to write
Candidatec~t\tilde{c}_ttanh(Wc[ht1,xt]+bc)\tanh(W_c [h_{t-1}, x_t] + b_c)What new info to write
Outputoto_tσ(Wo[ht1,xt]+bo)\sigma(W_o [h_{t-1}, x_t] + b_o)What to expose as output

Data Flow

c_{t-1} ──(×)────────────(+)────── c_t
           ↑               ↑
          f_t             i_t × c̃_t
           ↑               ↑
        ┌──────────────────────┐
        │   σ   σ  tanh   σ   │
        │   f   i    c̃   o   │
        └──────────────────────┘
              ↑         ↑
           h_{t-1}     x_t

h_t = o_t × tanh(c_t)

The cell state ctc_t flows along the top like a conveyor belt — information can be added or removed via gates, but the default path is to simply pass through unchanged.

✍️ Fill in the Blanks
The LSTM cell state is updated via (not multiplication), which creates a gradient that prevents vanishing.

Why LSTM Prevents Vanishing Gradients

In the backward pass, the cell state gradient is simply:

ctct1=ft\frac{\partial c_t}{\partial c_{t-1}} = f_t

Compare the two architectures over 100 timesteps:

ArchitectureGradient formulaAfter 100 steps
Vanilla RNNWhhTdiag(1h2)\prod W_{hh}^T \cdot \text{diag}(1 - h^2)0.00003\approx 0.00003 (vanished)
LSTMft\prod f_t0.991000.370.99^{100} \approx 0.37 (10,000× better)

The forget gate ftf_t is a learned scalar near 1, not a full matrix multiplication through tanh. This is the fundamental difference.


GRU: A Simpler Alternative

GRU achieves similar results with only 2 gates and 1 state (no separate cell state):

GRU Equations

rt=σ(Wr[ht1,xt]+br)(reset gate)r_t = \sigma(W_r [h_{t-1}, x_t] + b_r) \quad \text{(reset gate)} zt=σ(Wz[ht1,xt]+bz)(update gate)z_t = \sigma(W_z [h_{t-1}, x_t] + b_z) \quad \text{(update gate)} h~t=tanh(Wh[rtht1,xt]+bh)(candidate)\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h) \quad \text{(candidate)} ht=(1zt)ht1+zth~t(interpolate)h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \quad \text{(interpolate)}

The update gate ztz_t plays the role of both the forget and input gates — when zt0z_t \approx 0, the old state passes through unchanged (like ft1f_t \approx 1 in LSTM).

LSTM vs GRU

AspectLSTMGRU
Gates4 (forget, input, candidate, output)2 (reset, update)
States2 (hh and cc)1 (hh only)
ParametersMore~25% fewer
PerformanceSimilarSimilar
Best forMaximum capacitySpeed / simplicity
🤔 Quick Check
What happens in an LSTM if the forget gate f_t = 1 for all timesteps?

Key Equations

LSTM

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) c~t=tanh(Wc[ht1,xt]+bc)\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c) ct=ftct1+itc~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) ht=ottanh(ct)h_t = o_t \odot \tanh(c_t)

GRU

rt=σ(Wr[ht1,xt]+br)r_t = \sigma(W_r [h_{t-1}, x_t] + b_r) zt=σ(Wz[ht1,xt]+bz)z_t = \sigma(W_z [h_{t-1}, x_t] + b_z) h~t=tanh(Wh[rtht1,xt]+bh)\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h) ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

class LSTMCell:
    def __init__(self, input_size, hidden_size):
        self.hidden_size = hidden_size
        scale = 0.1
        # Combined weights: [W_f, W_i, W_c, W_o] stacked for efficiency
        self.W = np.random.randn(4 * hidden_size, input_size + hidden_size) * scale
        self.b = np.zeros(4 * hidden_size)
        self.b[:hidden_size] = 1.0  # Forget gate bias = 1 (remember by default)
    
    def forward(self, x_t, h_prev, c_prev):
        H = self.hidden_size
        concat = np.concatenate([h_prev, x_t])
        gates = self.W @ concat + self.b  # All gates in one matmul
        
        f_t = sigmoid(gates[0*H:1*H])      # Forget
        i_t = sigmoid(gates[1*H:2*H])      # Input
        c_tilde = np.tanh(gates[2*H:3*H])  # Candidate
        o_t = sigmoid(gates[3*H:4*H])      # Output
        
        c_t = f_t * c_prev + i_t * c_tilde  # Cell state (ADDITION!)
        h_t = o_t * np.tanh(c_t)
        return h_t, c_t

Connection to Transformers

In self-attention, the gradient from token jj to token ii is:

outputjinputi=αj,iWV\frac{\partial \text{output}_j}{\partial \text{input}_i} = \alpha_{j,i} \cdot W_V

This is a single multiplication — not a product over ji|j-i| steps:

ArchitectureGradient path for distance ddMultiplications
Vanilla RNNk=1dWhhTdiag(1hk2)\prod_{k=1}^{d} W_{hh}^T \cdot \text{diag}(1 - h_k^2)dd (vanishes)
LSTMk=1dfk\prod_{k=1}^{d} f_kdd (much better)
Transformerαj,iWV\alpha_{j,i} \cdot W_V1 (constant!)

This is a fundamental reason why Transformers replaced RNNs for most tasks.


When to Use What

Need to model sequences?
├── Sequence length < 50? → Vanilla RNN might work
├── Can parallelize?
│   ├── Yes → Transformer (preferred)
│   └── No → LSTM/GRU
└── Training speed critical?
    ├── Yes → GRU
    └── No → LSTM

Modern Perspective

Since 2017, Transformers have largely replaced LSTMs for most NLP tasks. LSTMs remain useful for:

  • Streaming/online processing (one token at a time)
  • Memory-constrained deployments
  • Time series with strict ordering

Exercises

  1. Eigenvalue experiment: If WhhW_{hh} has max eigenvalue 0.95, what fraction of the gradient remains after 50 timesteps?

    Answer0.95500.0770.95^{50} \approx 0.077 — only 7.7% of the gradient survives. After 100 steps: 0.951000.0060.95^{100} \approx 0.006 — essentially gone.

  2. Forget gate analysis: What happens if ft=0f_t = 0 for all timesteps? What if ft=1f_t = 1?

    Answerft=0f_t = 0: Cell state resets each step (like vanilla RNN). ft=1f_t = 1: Cell state accumulates forever (perfect memory).

  3. GRU as LSTM: Show that GRU is roughly equivalent to LSTM with: forget and input gates tied (ft=1itf_t = 1 - i_t), no output gate (ot=1o_t = 1), and cell state = hidden state.

  4. Parameter count: For input_size=256 and hidden_size=512, how many parameters does an LSTM cell have vs a GRU cell?

    AnswerLSTM: 4×512×(256+512)+4×512=1,574,9124 \times 512 \times (256 + 512) + 4 \times 512 = 1{,}574{,}912. GRU: 3×512×(256+512)+3×512=1,181,1843 \times 512 \times (256 + 512) + 3 \times 512 = 1{,}181{,}184. GRU is ~25% smaller.

Next Steps

With RNN architectures complete, we turn to how tokens enter the network. One-hot vectors are wasteful — learned embeddings are better. Next: Embeddings →