Vanishing Gradients & Gated RNNs
In the backward pass, we saw how gradients flow through time via repeated multiplication by . Now we’ll see why this causes a fundamental problem — and the elegant architectural solutions that fix it.
The Problem
Train an RNN on long sequences and plot gradient magnitude at each timestep:
Gradient Magnitude Through Time
Try it yourself: Drag the |λ| slider below 0.9 to see gradients vanish, or above 1.2 to see them explode. Then increase the sequence length T to see how the problem gets worse with longer sequences.
Consequence: early tokens don’t learn. The network can’t capture long-range dependencies.
Mathematical Analysis
Gradient Flow Through Time
From BPTT, each backward step multiplies by:
Over the full sequence:
The gradient’s fate depends on two factors:
1. Eigenvalues of
Let be the largest eigenvalue:
| | Gradient behavior | After 100 steps | |---|---|---| | 0.9 | Vanishes exponentially | | | 1.0 | Preserved (ideal but unstable) | | | 1.1 | Explodes exponentially | |
With typical random initialization (), eigenvalues are almost always → vanishing is the default.
2. Tanh Saturation
The tanh derivative further attenuates gradients at every step:
| Effect | ||
|---|---|---|
| 0.0 | 1.0 | Full gradient flow |
| ±0.5 | 0.75 | Mild attenuation |
| ±0.9 | 0.19 | Heavy attenuation |
| ±0.99 | 0.02 | 98% gradient loss per step |
These two effects compound: even if eigenvalues are near 1, saturated tanh neurons still kill the gradient.
Why This Matters
Consider: “The cat, which was sitting on the mat in the corner of the room, was sleeping.”
The subject “cat” determines “was” (not “were”), but they’re 14 words apart. With vanishing gradients, the signal from “was” barely reaches “cat” — the model can’t learn this dependency.
Practical Limits of Vanilla RNNs
| Sequence Length | Can RNN Learn Long-Range Dependencies? |
|---|---|
| < 20 tokens | Usually works ✅ |
| 20–50 tokens | Struggles ⚠️ |
| > 50 tokens | Fails ❌ |
Attempted Fixes (Partial)
| Fix | Helps with | Limitation |
|---|---|---|
| Gradient clipping | Explosion ✅ | Can’t fix vanishing ❌ |
| Orthogonal initialization | Initial gradient flow ✅ | Training changes , orthogonality lost ❌ |
| Careful learning rate | Stability ✅ | Doesn’t address root cause ❌ |
The real fix requires a change in architecture — replacing multiplicative state updates with additive ones.
The Solution: LSTM
The Key Insight: Addition vs Multiplication
In vanilla RNN, the hidden state is overwritten at each step via matrix multiplication — gradients flow through tanh and decay.
LSTM adds a cell state updated via addition:
The creates a gradient highway: if , gradients pass through unchanged.
The Four Gates
| Gate | Symbol | Formula | Purpose |
|---|---|---|---|
| Forget | How much old memory to keep | ||
| Input | How much new info to write | ||
| Candidate | What new info to write | ||
| Output | What to expose as output |
Data Flow
c_{t-1} ──(×)────────────(+)────── c_t
↑ ↑
f_t i_t × c̃_t
↑ ↑
┌──────────────────────┐
│ σ σ tanh σ │
│ f i c̃ o │
└──────────────────────┘
↑ ↑
h_{t-1} x_t
h_t = o_t × tanh(c_t)
The cell state flows along the top like a conveyor belt — information can be added or removed via gates, but the default path is to simply pass through unchanged.
Why LSTM Prevents Vanishing Gradients
In the backward pass, the cell state gradient is simply:
Compare the two architectures over 100 timesteps:
| Architecture | Gradient formula | After 100 steps |
|---|---|---|
| Vanilla RNN | (vanished) | |
| LSTM | (10,000× better) |
The forget gate is a learned scalar near 1, not a full matrix multiplication through tanh. This is the fundamental difference.
GRU: A Simpler Alternative
GRU achieves similar results with only 2 gates and 1 state (no separate cell state):
GRU Equations
The update gate plays the role of both the forget and input gates — when , the old state passes through unchanged (like in LSTM).
LSTM vs GRU
| Aspect | LSTM | GRU |
|---|---|---|
| Gates | 4 (forget, input, candidate, output) | 2 (reset, update) |
| States | 2 ( and ) | 1 ( only) |
| Parameters | More | ~25% fewer |
| Performance | Similar | Similar |
| Best for | Maximum capacity | Speed / simplicity |
Key Equations
LSTM
GRU
class LSTMCell:
def __init__(self, input_size, hidden_size):
self.hidden_size = hidden_size
scale = 0.1
# Combined weights: [W_f, W_i, W_c, W_o] stacked for efficiency
self.W = np.random.randn(4 * hidden_size, input_size + hidden_size) * scale
self.b = np.zeros(4 * hidden_size)
self.b[:hidden_size] = 1.0 # Forget gate bias = 1 (remember by default)
def forward(self, x_t, h_prev, c_prev):
H = self.hidden_size
concat = np.concatenate([h_prev, x_t])
gates = self.W @ concat + self.b # All gates in one matmul
f_t = sigmoid(gates[0*H:1*H]) # Forget
i_t = sigmoid(gates[1*H:2*H]) # Input
c_tilde = np.tanh(gates[2*H:3*H]) # Candidate
o_t = sigmoid(gates[3*H:4*H]) # Output
c_t = f_t * c_prev + i_t * c_tilde # Cell state (ADDITION!)
h_t = o_t * np.tanh(c_t)
return h_t, c_tConnection to Transformers
In self-attention, the gradient from token to token is:
This is a single multiplication — not a product over steps:
| Architecture | Gradient path for distance | Multiplications |
|---|---|---|
| Vanilla RNN | (vanishes) | |
| LSTM | (much better) | |
| Transformer | 1 (constant!) |
This is a fundamental reason why Transformers replaced RNNs for most tasks.
When to Use What
Need to model sequences?
├── Sequence length < 50? → Vanilla RNN might work
├── Can parallelize?
│ ├── Yes → Transformer (preferred)
│ └── No → LSTM/GRU
└── Training speed critical?
├── Yes → GRU
└── No → LSTM
Modern Perspective
Since 2017, Transformers have largely replaced LSTMs for most NLP tasks. LSTMs remain useful for:
- Streaming/online processing (one token at a time)
- Memory-constrained deployments
- Time series with strict ordering
Exercises
-
Eigenvalue experiment: If has max eigenvalue 0.95, what fraction of the gradient remains after 50 timesteps?
Answer
— only 7.7% of the gradient survives. After 100 steps: — essentially gone. -
Forget gate analysis: What happens if for all timesteps? What if ?
Answer
: Cell state resets each step (like vanilla RNN). : Cell state accumulates forever (perfect memory). -
GRU as LSTM: Show that GRU is roughly equivalent to LSTM with: forget and input gates tied (), no output gate (), and cell state = hidden state.
-
Parameter count: For input_size=256 and hidden_size=512, how many parameters does an LSTM cell have vs a GRU cell?
Answer
LSTM: . GRU: . GRU is ~25% smaller.
Next Steps
With RNN architectures complete, we turn to how tokens enter the network. One-hot vectors are wasteful — learned embeddings are better. Next: Embeddings →