Backward Pass & Backpropagation Through Time
In the forward pass, we computed predictions and measured our error. Now we need to answer: how should we adjust each weight to reduce that error? This is where the backward pass comes in — computing gradients that tell us the direction and magnitude of change for every parameter.
Why Gradients Matter
A gradient tells us: “If I nudge by a tiny amount, how much does the loss change?” The sign tells us the direction, the magnitude tells us the sensitivity.
Training is just repeated gradient descent:
where is the learning rate.
The Challenge: Shared Weights
In a feedforward network, each layer has independent weights. In an RNN, the same weights are reused at every timestep:
Feedforward: x → W₁ → W₂ → W₃ → y (3 separate weight matrices)
RNN: x₀ → W → h₁ → W → h₂ → W → h₃ (same W reused)
Consequence: When computing , we must account for all uses of across time — the total gradient sums contributions from every timestep:
This is what makes RNN backpropagation special — and why it’s called Backpropagation Through Time (BPTT).
The Chain Rule: Our Main Tool
The backward pass is built entirely on the chain rule. If depends on , which depends on , which depends on :
We compute these pieces one at a time, starting from the loss and working backward — hence “backward pass.”
Step-by-Step Gradient Derivation
Recap: Forward Equations
Step 1: Gradient of Loss w.r.t. Output (Softmax + Cross-Entropy)
The combined gradient of cross-entropy loss through softmax has a beautifully simple form:
where is a one-hot vector with 1 at the target index.
Example: If the target is index 2 and :
The negative value at index 2 says: “increase this logit to make the correct answer more likely.”
For where :
Step 2: Output Layer Gradients
Since , applying the chain rule:
Step 3: Gradient Flowing into the Hidden State
The gradient reaches from the output layer: .
But here’s the crucial part — in an RNN, also influences , which influences , and so on. So the total gradient at includes contributions from all future timesteps:
where is the gradient flowing back from timestep . This is the “through time” in BPTT.
Step 4: Backpropagating Through tanh
Recall that where .
The derivative of tanh is:
So the gradient through tanh is:
Key insight: When (saturated), and the gradient vanishes at that neuron. When , the gradient passes through freely.
| value | Gradient effect | |
|---|---|---|
| 0.0 | 1.0 | Passes through fully |
| ±0.5 | 0.75 | Mild reduction |
| ±0.9 | 0.19 | 81% lost |
| ±0.99 | 0.02 | 98% lost (nearly killed) |
Step 5: Weight Gradients for the Hidden Layer
Now that we have , we can compute gradients for all hidden-layer parameters:
Critical: These are accumulated with += across all timesteps because the same weights are shared. The total gradient is the sum of contributions from every step.
Step 6: Pass Gradient to Previous Timestep
Finally, we send the gradient backward to so the chain can continue:
This becomes the term in Step 3 for the previous timestep, completing the loop.
Visualizing the Gradient Flow
Here’s how gradients flow through a 3-step RNN:
L₀ L₁ L₂ (losses)
↑ ↑ ↑
dy₀ dy₁ dy₂ (Step 1: softmax gradient)
↑ ↑ ↑
dWhy,dby dWhy,dby dWhy,dby (Step 2: output weights — accumulated)
↑ ↑ ↑
dh₀ ← dh₁ ← dh₂ (Step 3: hidden gradient — two sources)
↑ ↑ ↑
dh_raw₀ dh_raw₁ dh_raw₂ (Step 4: through tanh)
↑ ↑ ↑
dWxh,dWhh,dbh (all accumulated) (Step 5: hidden weights)
↑ ↑
dh_next₀ dh_next₁ (Step 6: backward chain)
The arrows pointing left (←) show the “through time” gradient chain — this is what makes BPTT special and what causes the vanishing/exploding gradient problem for long sequences.
Quick Summary
| Step | What | Math |
|---|---|---|
| 1 | Output gradient | |
| 2 | Output weights | |
| 3 | Hidden gradient | |
| 4 | Through tanh | |
| 5 | Hidden weights | |
| 6 | Time chain |
Gradient Clipping: Taming Explosions
When sequences are long, the repeated multiplication by can cause gradients to grow exponentially. If the largest eigenvalue of is :
For and : — the gradient explodes!
Norm clipping scales the entire gradient vector if its norm exceeds a threshold. This preserves the gradient’s direction while bounding its magnitude:
An alternative is value clipping (clamp each element to ) — simpler but doesn’t preserve gradient direction.
Verifying with Numerical Gradients
Backprop bugs are notoriously hard to spot — the code runs, loss decreases (slowly), but gradients are subtly wrong. Always verify with finite differences:
Compare your analytical gradient against this numerical approximation:
| Relative Error | Diagnosis |
|---|---|
| < 1e-7 | Perfect ✅ |
| 1e-7 to 1e-5 | Good (numerical precision limit) |
| 1e-5 to 1e-3 | Suspicious — check implementation ⚠️ |
| > 1e-3 | Bug in backprop ❌ |
Common Mistakes
| Mistake | Wrong | Correct |
|---|---|---|
| Overwriting gradients | ||
| Wrong transpose | ||
| Wrong hidden state | ||
| Forgetting dh_next | ||
| Not clipping | Direct update | Clip norm before update |
A powerful debugging technique — if the matrix shapes don’t match, the gradient is wrong:
| Parameter | Shape | Gradient Computed As |
|---|---|---|
| — outer product of and | ||
| — outer product of and | ||
| — outer product of and | ||
| directly | ||
| directly |
Where = hidden size, = vocabulary size.
Rule of thumb: The gradient of a weight matrix used as is the outer product of the upstream gradient and .
def backward_pass(targets, cache):
"""Compute gradients via Backpropagation Through Time."""
xs, hs, ps = cache['xs'], cache['hs'], cache['ps']
T = len(targets)
dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
dbh, dby = np.zeros_like(bh), np.zeros_like(by)
dh_next = np.zeros_like(hs[0])
for t in reversed(range(T)):
dy = ps[t].copy()
dy[targets[t]] -= 1 # Step 1
dWhy += np.outer(dy, hs[t]); dby += dy # Step 2
dh = Why.T @ dy + dh_next # Step 3
dh_raw = (1 - hs[t] ** 2) * dh # Step 4
h_prev = hs[t-1] if t > 0 else np.zeros_like(hs[0])
dWxh += np.outer(dh_raw, xs[t]) # Step 5
dWhh += np.outer(dh_raw, h_prev)
dbh += dh_raw
dh_next = Whh.T @ dh_raw # Step 6
return dWxh, dWhh, dWhy, dbh, dby
def clip_gradients(grads, max_norm=5.0):
"""Scale gradients if their total norm exceeds max_norm."""
total_norm = np.sqrt(sum(np.sum(g ** 2) for g in grads))
if total_norm > max_norm:
scale = max_norm / total_norm
return [g * scale for g in grads]
return grads
def numerical_gradient(param, idx, loss_fn, eps=1e-5):
"""Verify analytical gradients with finite differences."""
old_val = param.flat[idx]
param.flat[idx] = old_val + eps; loss_plus = loss_fn()
param.flat[idx] = old_val - eps; loss_minus = loss_fn()
param.flat[idx] = old_val
return (loss_plus - loss_minus) / (2 * eps)Computational Complexity
For sequence length , hidden size , vocabulary size :
| Time | Memory | |
|---|---|---|
| Forward pass | ||
| Backward pass | Same (reuses forward cache) |
BPTT has the same time complexity as the forward pass. Memory scales linearly with sequence length because we store all hidden states for backprop.
Exercises
-
Gradient accumulation: In a 3-step sequence with , why does contribute nothing to ?
Answer
Because . -
Gradient depth: In a 5-step sequence, how many times does appear in the gradient path from to ?
Answer
4 times — one multiplication by per step backward. This is why gradients vanish/explode over long sequences. -
Implement and verify: Write BPTT and verify with numerical gradient checking (error < 1e-5 for all parameters).
Next Steps
You now understand how gradients flow backward through an RNN. But there’s a fundamental problem: for long sequences, the repeated multiplication by causes gradients to either vanish (most common) or explode. Next up: Vanishing Gradients & Gated RNNs →, where we’ll see why this happens and the architectures that fix it.