Backward Pass & Backpropagation Through Time

In the forward pass, we computed predictions and measured our error. Now we need to answer: how should we adjust each weight to reduce that error? This is where the backward pass comes in — computing gradients that tell us the direction and magnitude of change for every parameter.

Why Gradients Matter

A gradient LW\frac{\partial L}{\partial W} tells us: “If I nudge WW by a tiny amount, how much does the loss change?” The sign tells us the direction, the magnitude tells us the sensitivity.

Training is just repeated gradient descent:

WWηLWW \leftarrow W - \eta \cdot \frac{\partial L}{\partial W}

where η\eta is the learning rate.

The Challenge: Shared Weights

In a feedforward network, each layer has independent weights. In an RNN, the same weights are reused at every timestep:

Feedforward:  x → W₁ → W₂ → W₃ → y    (3 separate weight matrices)
RNN:          x₀ → W → h₁ → W → h₂ → W → h₃    (same W reused)

Consequence: When computing LW\frac{\partial L}{\partial W}, we must account for all uses of WW across time — the total gradient sums contributions from every timestep:

LWhh=t=1TLtWhh\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_{hh}}

This is what makes RNN backpropagation special — and why it’s called Backpropagation Through Time (BPTT).

The Chain Rule: Our Main Tool

The backward pass is built entirely on the chain rule. If LL depends on yy, which depends on hh, which depends on WW:

LW=LyyhhW\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial W}

We compute these pieces one at a time, starting from the loss and working backward — hence “backward pass.”

Step-by-Step Gradient Derivation

Recap: Forward Equations

ht=tanh(Wxhxt+Whhht1+bh)h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h) yt=Whyht+byy_t = W_{hy} \cdot h_t + b_y pt=softmax(yt)p_t = \text{softmax}(y_t) Lt=log(pt[targett])L_t = -\log(p_t[\text{target}_t])

Step 1: Gradient of Loss w.r.t. Output (Softmax + Cross-Entropy)

The combined gradient of cross-entropy loss through softmax has a beautifully simple form:

Ltyt=pt1target\frac{\partial L_t}{\partial y_t} = p_t - \mathbf{1}_{\text{target}}

where 1target\mathbf{1}_{\text{target}} is a one-hot vector with 1 at the target index.

Example: If the target is index 2 and pt=[0.1,0.2,0.6,0.1]p_t = [0.1, 0.2, 0.6, 0.1]:

dyt=[0.1,0.2,0.61.0,0.1]=[0.1,0.2,0.4,0.1]dy_t = [0.1, 0.2, 0.6 - 1.0, 0.1] = [0.1, 0.2, -0.4, 0.1]

The negative value at index 2 says: “increase this logit to make the correct answer more likely.”

For L=log(ptarget)L = -\log(p_{\text{target}}) where pi=eyijeyjp_i = \frac{e^{y_i}}{\sum_j e^{y_j}}:

Lyi=1i=target+eyijeyj=pi1i=target\frac{\partial L}{\partial y_i} = -\mathbb{1}_{i=\text{target}} + \frac{e^{y_i}}{\sum_j e^{y_j}} = p_i - \mathbb{1}_{i=\text{target}}

🤔 Quick Check
If the model predicts the correct character with probability 0.95, what can you say about the gradient dy at the target index?

Step 2: Output Layer Gradients

Since yt=Whyht+byy_t = W_{hy} \cdot h_t + b_y, applying the chain rule:

LtWhy=dythtT,Ltby=dyt\frac{\partial L_t}{\partial W_{hy}} = dy_t \cdot h_t^T, \quad \frac{\partial L_t}{\partial b_y} = dy_t

Step 3: Gradient Flowing into the Hidden State

The gradient reaches hth_t from the output layer: dht=WhyTdytdh_t = W_{hy}^T \cdot dy_t.

But here’s the crucial part — in an RNN, hth_t also influences ht+1h_{t+1}, which influences ht+2h_{t+2}, and so on. So the total gradient at hth_t includes contributions from all future timesteps:

dhttotal=WhyTdyt+dhnextdh_t^{\text{total}} = W_{hy}^T \cdot dy_t + dh_{\text{next}}

where dhnextdh_{\text{next}} is the gradient flowing back from timestep t+1t+1. This is the “through time” in BPTT.

✍️ Fill in the Blanks
The hidden state gradient has two sources: the layer at the current timestep, and the timestep flowing backward through time.

Step 4: Backpropagating Through tanh

Recall that ht=tanh(zt)h_t = \tanh(z_t) where zt=Wxhxt+Whhht1+bhz_t = W_{xh} x_t + W_{hh} h_{t-1} + b_h.

The derivative of tanh is:

ddztanh(z)=1tanh2(z)=1ht2\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z) = 1 - h_t^2

So the gradient through tanh is:

dzt=(1ht2)dhtdz_t = (1 - h_t^2) \odot dh_t

Key insight: When ht±1h_t \approx \pm 1 (saturated), (1ht2)0(1 - h_t^2) \approx 0 and the gradient vanishes at that neuron. When ht0h_t \approx 0, the gradient passes through freely.

hth_t value1ht21 - h_t^2Gradient effect
0.01.0Passes through fully
±0.50.75Mild reduction
±0.90.1981% lost
±0.990.0298% lost (nearly killed)
🤔 Quick Check
If a hidden unit has value h_t = 0.99, what happens to the gradient flowing through it?

Step 5: Weight Gradients for the Hidden Layer

Now that we have dztdz_t, we can compute gradients for all hidden-layer parameters:

LtWxh=dztxtT,LtWhh=dztht1T,Ltbh=dzt\frac{\partial L_t}{\partial W_{xh}} = dz_t \cdot x_t^T, \quad \frac{\partial L_t}{\partial W_{hh}} = dz_t \cdot h_{t-1}^T, \quad \frac{\partial L_t}{\partial b_h} = dz_t

Critical: These are accumulated with += across all timesteps because the same weights are shared. The total gradient is the sum of contributions from every step.

Step 6: Pass Gradient to Previous Timestep

Finally, we send the gradient backward to ht1h_{t-1} so the chain can continue:

dhnext=WhhTdztdh_{\text{next}} = W_{hh}^T \cdot dz_t

This becomes the dhnextdh_{\text{next}} term in Step 3 for the previous timestep, completing the loop.

Visualizing the Gradient Flow

Here’s how gradients flow through a 3-step RNN:

        L₀          L₁          L₂        (losses)
        ↑           ↑           ↑
       dy₀         dy₁         dy₂        (Step 1: softmax gradient)
        ↑           ↑           ↑
  dWhy,dby    dWhy,dby    dWhy,dby        (Step 2: output weights — accumulated)
        ↑           ↑           ↑
       dh₀    ←    dh₁    ←    dh₂        (Step 3: hidden gradient — two sources)
        ↑           ↑           ↑
     dh_raw₀     dh_raw₁     dh_raw₂     (Step 4: through tanh)
        ↑           ↑           ↑
  dWxh,dWhh,dbh  (all accumulated)         (Step 5: hidden weights)
        ↑           ↑
    dh_next₀    dh_next₁                  (Step 6: backward chain)

The arrows pointing left (←) show the “through time” gradient chain — this is what makes BPTT special and what causes the vanishing/exploding gradient problem for long sequences.

Quick Summary

StepWhatMath
1Output gradientdy=pt1targetdy = p_t - \mathbf{1}_{\text{target}}
2Output weightsdWhy+=dyhtTdW_{hy} \mathrel{+}= dy \cdot h_t^T
3Hidden gradientdh=WhyTdy+dhnextdh = W_{hy}^T dy + dh_{\text{next}}
4Through tanhdz=(1ht2)dhdz = (1-h_t^2) \odot dh
5Hidden weightsdWxh+=dzxtTdW_{xh} \mathrel{+}= dz \cdot x_t^T
6Time chaindhnext=WhhTdzdh_{\text{next}} = W_{hh}^T \cdot dz

Gradient Clipping: Taming Explosions

When sequences are long, the repeated multiplication by WhhTW_{hh}^T can cause gradients to grow exponentially. If the largest eigenvalue of WhhW_{hh} is λ>1\lambda > 1:

dh1λTdhT\|dh_1\| \approx \lambda^T \cdot \|dh_T\|

For λ=1.1\lambda = 1.1 and T=100T = 100: 1.110013,7801.1^{100} \approx 13{,}780 — the gradient explodes!

Norm clipping scales the entire gradient vector if its norm exceeds a threshold. This preserves the gradient’s direction while bounding its magnitude:

gclipped={gif gmax_normgmax_normgotherwiseg_{\text{clipped}} = \begin{cases} g & \text{if } \|g\| \leq \text{max\_norm} \\ g \cdot \frac{\text{max\_norm}}{\|g\|} & \text{otherwise} \end{cases}

An alternative is value clipping (clamp each element to [5,5][-5, 5]) — simpler but doesn’t preserve gradient direction.

Verifying with Numerical Gradients

Backprop bugs are notoriously hard to spot — the code runs, loss decreases (slowly), but gradients are subtly wrong. Always verify with finite differences:

LwL(w+ϵ)L(wϵ)2ϵ\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon}

Compare your analytical gradient against this numerical approximation:

Relative ErrorDiagnosis
< 1e-7Perfect ✅
1e-7 to 1e-5Good (numerical precision limit)
1e-5 to 1e-3Suspicious — check implementation ⚠️
> 1e-3Bug in backprop ❌
🤔 Quick Check
Why do we use (L(w+ε) - L(w-ε)) / 2ε instead of (L(w+ε) - L(w)) / ε for gradient checking?

Common Mistakes

MistakeWrongCorrect
Overwriting gradientsdWxh=dW_{xh} = \ldotsdWxh+=dW_{xh} \mathrel{+}= \ldots
Wrong transposeWhydyW_{hy} \cdot dyWhyTdyW_{hy}^T \cdot dy
Wrong hidden statedzhtTdz \cdot h_t^Tdzht1Tdz \cdot h_{t-1}^T
Forgetting dh_nextdh=WhyTdydh = W_{hy}^T dydh=WhyTdy+dhnextdh = W_{hy}^T dy + dh_{\text{next}}
Not clippingDirect updateClip norm before update

A powerful debugging technique — if the matrix shapes don’t match, the gradient is wrong:

ParameterShapeGradient Computed As
WxhW_{xh}(H,V)(H, V)dztxtTdz_t \cdot x_t^T — outer product of (H,)(H,) and (V,)(V,)
WhhW_{hh}(H,H)(H, H)dztht1Tdz_t \cdot h_{t-1}^T — outer product of (H,)(H,) and (H,)(H,)
WhyW_{hy}(V,H)(V, H)dythtTdy_t \cdot h_t^T — outer product of (V,)(V,) and (H,)(H,)
bhb_h(H,)(H,)dztdz_t directly
byb_y(V,)(V,)dytdy_t directly

Where HH = hidden size, VV = vocabulary size.

Rule of thumb: The gradient of a weight matrix WW used as WaW \cdot a is the outer product of the upstream gradient and aa.

def backward_pass(targets, cache):
    """Compute gradients via Backpropagation Through Time."""
    xs, hs, ps = cache['xs'], cache['hs'], cache['ps']
    T = len(targets)

    dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dh_next = np.zeros_like(hs[0])

    for t in reversed(range(T)):
        dy = ps[t].copy()
        dy[targets[t]] -= 1                          # Step 1

        dWhy += np.outer(dy, hs[t]); dby += dy       # Step 2
        dh = Why.T @ dy + dh_next                     # Step 3
        dh_raw = (1 - hs[t] ** 2) * dh                # Step 4

        h_prev = hs[t-1] if t > 0 else np.zeros_like(hs[0])
        dWxh += np.outer(dh_raw, xs[t])               # Step 5
        dWhh += np.outer(dh_raw, h_prev)
        dbh  += dh_raw

        dh_next = Whh.T @ dh_raw                      # Step 6

    return dWxh, dWhh, dWhy, dbh, dby

def clip_gradients(grads, max_norm=5.0):
    """Scale gradients if their total norm exceeds max_norm."""
    total_norm = np.sqrt(sum(np.sum(g ** 2) for g in grads))
    if total_norm > max_norm:
        scale = max_norm / total_norm
        return [g * scale for g in grads]
    return grads

def numerical_gradient(param, idx, loss_fn, eps=1e-5):
    """Verify analytical gradients with finite differences."""
    old_val = param.flat[idx]
    param.flat[idx] = old_val + eps; loss_plus = loss_fn()
    param.flat[idx] = old_val - eps; loss_minus = loss_fn()
    param.flat[idx] = old_val
    return (loss_plus - loss_minus) / (2 * eps)

Computational Complexity

For sequence length TT, hidden size HH, vocabulary size VV:

TimeMemory
Forward passO(THV)O(T \cdot H \cdot V)O(T(H+V))O(T \cdot (H + V))
Backward passO(THV)O(T \cdot H \cdot V)Same (reuses forward cache)

BPTT has the same time complexity as the forward pass. Memory scales linearly with sequence length because we store all hidden states for backprop.

Exercises

  1. Gradient accumulation: In a 3-step sequence with h1=0h_{-1} = 0, why does t=0t=0 contribute nothing to LWhh\frac{\partial L}{\partial W_{hh}}?

    AnswerBecause dWhh,0=dz0h1T=dz00T=0dW_{hh,0} = dz_0 \cdot h_{-1}^T = dz_0 \cdot \mathbf{0}^T = \mathbf{0}.

  2. Gradient depth: In a 5-step sequence, how many times does WhhW_{hh} appear in the gradient path from L4L_4 to h0h_0?

    Answer4 times — one multiplication by WhhTW_{hh}^T per step backward. This is why gradients vanish/explode over long sequences.

  3. Implement and verify: Write BPTT and verify with numerical gradient checking (error < 1e-5 for all parameters).

Next Steps

You now understand how gradients flow backward through an RNN. But there’s a fundamental problem: for long sequences, the repeated multiplication by WhhTW_{hh}^T causes gradients to either vanish (most common) or explode. Next up: Vanishing Gradients & Gated RNNs →, where we’ll see why this happens and the architectures that fix it.