Backward Pass & Backpropagation Through Time

In the forward pass, we computed predictions and measured our error. Now we need to answer: how should we adjust each weight to reduce that error? This is where the backward pass comes in — computing gradients that tell us the direction and magnitude of change for every parameter.

Why Gradients Matter

A gradient $\frac{\partial L}{\partial W}$ tells us: “If I nudge $W$ by a tiny amount, how much does the loss change?” The sign tells us the direction, the magnitude tells us the sensitivity.

Training is just repeated gradient descent:

$W \leftarrow W - \eta \cdot \frac{\partial L}{\partial W}$

where $\eta$ is the learning rate.

The Challenge: Shared Weights

In a feedforward network, each layer has independent weights. In an RNN, the same weights are reused at every timestep:

Feedforward:  x → W₁ → W₂ → W₃ → y    (3 separate weight matrices)
RNN:          x₀ → W → h₁ → W → h₂ → W → h₃    (same W reused)

Consequence: When computing $\frac{\partial L}{\partial W}$ , we must account for all uses of $W$ across time — the total gradient sums contributions from every timestep:

$\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_{hh}}$

This is what makes RNN backpropagation special — and why it’s called Backpropagation Through Time (BPTT).

The Chain Rule: Our Main Tool

The backward pass is built entirely on the chain rule. If $L$ depends on $y$ , which depends on $h$ , which depends on $W$ :

$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial h} \cdot \frac{\partial h}{\partial W}$

We compute these pieces one at a time, starting from the loss and working backward — hence “backward pass.”

Step-by-Step Gradient Derivation

Recap: Forward Equations

$h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)$ $y_t = W_{hy} \cdot h_t + b_y$ $p_t = \text{softmax}(y_t)$ $L_t = -\log(p_t[\text{target}_t])$

Step 1: Gradient of Loss w.r.t. Output (Softmax + Cross-Entropy)

The combined gradient of cross-entropy loss through softmax has a beautifully simple form:

$\frac{\partial L_t}{\partial y_t} = p_t - \mathbf{1}_{\text{target}}$

where $\mathbf{1}_{\text{target}}$ is a one-hot vector with 1 at the target index.

Example: If the target is index 2 and $p_t = [0.1, 0.2, 0.6, 0.1]$ :

$dy_t = [0.1, 0.2, 0.6 - 1.0, 0.1] = [0.1, 0.2, -0.4, 0.1]$

The negative value at index 2 says: “increase this logit to make the correct answer more likely.”

For $L = -\log(p_{\text{target}})$ where $p_i = \frac{e^{y_i}}{\sum_j e^{y_j}}$ :

$\frac{\partial L}{\partial y_i} = -\mathbb{1}_{i=\text{target}} + \frac{e^{y_i}}{\sum_j e^{y_j}} = p_i - \mathbb{1}_{i=\text{target}}$

🤔 Quick Check

If the model predicts the correct character with probability 0.95, what can you say about the gradient dy at the target index?

Step 2: Output Layer Gradients

Since $y_t = W_{hy} \cdot h_t + b_y$ , applying the chain rule:

$\frac{\partial L_t}{\partial W_{hy}} = dy_t \cdot h_t^T, \quad \frac{\partial L_t}{\partial b_y} = dy_t$

Step 3: Gradient Flowing into the Hidden State

The gradient reaches $h_t$ from the output layer: $dh_t = W_{hy}^T \cdot dy_t$ .

But here’s the crucial part — in an RNN, $h_t$ also influences $h_{t+1}$ , which influences $h_{t+2}$ , and so on. So the total gradient at $h_t$ includes contributions from all future timesteps:

$dh_t^{\text{total}} = W_{hy}^T \cdot dy_t + dh_{\text{next}}$

where $dh_{\text{next}}$ is the gradient flowing back from timestep $t+1$ . This is the “through time” in BPTT.

✍️ Fill in the Blanks

The hidden state gradient has two sources: the layer at the current timestep, and the timestep flowing backward through time.

Step 4: Backpropagating Through tanh

Recall that $h_t = \tanh(z_t)$ where $z_t = W_{xh} x_t + W_{hh} h_{t-1} + b_h$ .

The derivative of tanh is:

$\frac{d}{dz}\tanh(z) = 1 - \tanh^2(z) = 1 - h_t^2$

So the gradient through tanh is:

$dz_t = (1 - h_t^2) \odot dh_t$

Key insight: When $h_t \approx \pm 1$ (saturated), $(1 - h_t^2) \approx 0$ and the gradient vanishes at that neuron. When $h_t \approx 0$ , the gradient passes through freely.

$h_t$ value	$1 - h_t^2$	Gradient effect
0.0	1.0	Passes through fully
±0.5	0.75	Mild reduction
±0.9	0.19	81% lost
±0.99	0.02	98% lost (nearly killed)

🤔 Quick Check

If a hidden unit has value h_t = 0.99, what happens to the gradient flowing through it?

Step 5: Weight Gradients for the Hidden Layer

Now that we have $dz_t$ , we can compute gradients for all hidden-layer parameters:

$\frac{\partial L_t}{\partial W_{xh}} = dz_t \cdot x_t^T, \quad \frac{\partial L_t}{\partial W_{hh}} = dz_t \cdot h_{t-1}^T, \quad \frac{\partial L_t}{\partial b_h} = dz_t$

Critical: These are accumulated with += across all timesteps because the same weights are shared. The total gradient is the sum of contributions from every step.

Step 6: Pass Gradient to Previous Timestep

Finally, we send the gradient backward to $h_{t-1}$ so the chain can continue:

$dh_{\text{next}} = W_{hh}^T \cdot dz_t$

This becomes the $dh_{\text{next}}$ term in Step 3 for the previous timestep, completing the loop.

Visualizing the Gradient Flow

Here’s how gradients flow through a 3-step RNN:

        L₀          L₁          L₂        (losses)
        ↑           ↑           ↑
       dy₀         dy₁         dy₂        (Step 1: softmax gradient)
        ↑           ↑           ↑
  dWhy,dby    dWhy,dby    dWhy,dby        (Step 2: output weights — accumulated)
        ↑           ↑           ↑
       dh₀    ←    dh₁    ←    dh₂        (Step 3: hidden gradient — two sources)
        ↑           ↑           ↑
     dh_raw₀     dh_raw₁     dh_raw₂     (Step 4: through tanh)
        ↑           ↑           ↑
  dWxh,dWhh,dbh  (all accumulated)         (Step 5: hidden weights)
        ↑           ↑
    dh_next₀    dh_next₁                  (Step 6: backward chain)

The arrows pointing left (←) show the “through time” gradient chain — this is what makes BPTT special and what causes the vanishing/exploding gradient problem for long sequences.

Quick Summary

Step	What	Math
1	Output gradient	$dy = p_t - \mathbf{1}_{\text{target}}$
2	Output weights	$dW_{hy} \mathrel{+}= dy \cdot h_t^T$
3	Hidden gradient	$dh = W_{hy}^T dy + dh_{\text{next}}$
4	Through tanh	$dz = (1-h_t^2) \odot dh$
5	Hidden weights	$dW_{xh} \mathrel{+}= dz \cdot x_t^T$
6	Time chain	$dh_{\text{next}} = W_{hh}^T \cdot dz$

Gradient Clipping: Taming Explosions

When sequences are long, the repeated multiplication by $W_{hh}^T$ can cause gradients to grow exponentially. If the largest eigenvalue of $W_{hh}$ is $\lambda > 1$ :

$\|dh_1\| \approx \lambda^T \cdot \|dh_T\|$

For $\lambda = 1.1$ and $T = 100$ : $1.1^{100} \approx 13{,}780$ — the gradient explodes!

Norm clipping scales the entire gradient vector if its norm exceeds a threshold. This preserves the gradient’s direction while bounding its magnitude:

$g_{\text{clipped}} = \begin{cases} g & \text{if } \|g\| \leq \text{max\_norm} \\ g \cdot \frac{\text{max\_norm}}{\|g\|} & \text{otherwise} \end{cases}$

An alternative is value clipping (clamp each element to $[-5, 5]$ ) — simpler but doesn’t preserve gradient direction.

Verifying with Numerical Gradients

Backprop bugs are notoriously hard to spot — the code runs, loss decreases (slowly), but gradients are subtly wrong. Always verify with finite differences:

$\frac{\partial L}{\partial w} \approx \frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon}$

Compare your analytical gradient against this numerical approximation:

Relative Error	Diagnosis
< 1e-7	Perfect ✅
1e-7 to 1e-5	Good (numerical precision limit)
1e-5 to 1e-3	Suspicious — check implementation ⚠️
> 1e-3	Bug in backprop ❌

🤔 Quick Check

Why do we use (L(w+ε) - L(w-ε)) / 2ε instead of (L(w+ε) - L(w)) / ε for gradient checking?

Common Mistakes

Mistake	Wrong	Correct
Overwriting gradients	$dW_{xh} = \ldots$	$dW_{xh} \mathrel{+}= \ldots$
Wrong transpose	$W_{hy} \cdot dy$	$W_{hy}^T \cdot dy$
Wrong hidden state	$dz \cdot h_t^T$	$dz \cdot h_{t-1}^T$
Forgetting dh_next	$dh = W_{hy}^T dy$	$dh = W_{hy}^T dy + dh_{\text{next}}$
Not clipping	Direct update	Clip norm before update

A powerful debugging technique — if the matrix shapes don’t match, the gradient is wrong:

Parameter	Shape	Gradient Computed As
$W_{xh}$	$(H, V)$	$dz_t \cdot x_t^T$ — outer product of $(H,)$ and $(V,)$
$W_{hh}$	$(H, H)$	$dz_t \cdot h_{t-1}^T$ — outer product of $(H,)$ and $(H,)$
$W_{hy}$	$(V, H)$	$dy_t \cdot h_t^T$ — outer product of $(V,)$ and $(H,)$
$b_h$	$(H,)$	$dz_t$ directly
$b_y$	$(V,)$	$dy_t$ directly

Where $H$ = hidden size, $V$ = vocabulary size.

Rule of thumb: The gradient of a weight matrix $W$ used as $W \cdot a$ is the outer product of the upstream gradient and $a$ .

def backward_pass(targets, cache):
    """Compute gradients via Backpropagation Through Time."""
    xs, hs, ps = cache['xs'], cache['hs'], cache['ps']
    T = len(targets)

    dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
    dbh, dby = np.zeros_like(bh), np.zeros_like(by)
    dh_next = np.zeros_like(hs[0])

    for t in reversed(range(T)):
        dy = ps[t].copy()
        dy[targets[t]] -= 1                          # Step 1

        dWhy += np.outer(dy, hs[t]); dby += dy       # Step 2
        dh = Why.T @ dy + dh_next                     # Step 3
        dh_raw = (1 - hs[t] ** 2) * dh                # Step 4

        h_prev = hs[t-1] if t > 0 else np.zeros_like(hs[0])
        dWxh += np.outer(dh_raw, xs[t])               # Step 5
        dWhh += np.outer(dh_raw, h_prev)
        dbh  += dh_raw

        dh_next = Whh.T @ dh_raw                      # Step 6

    return dWxh, dWhh, dWhy, dbh, dby

def clip_gradients(grads, max_norm=5.0):
    """Scale gradients if their total norm exceeds max_norm."""
    total_norm = np.sqrt(sum(np.sum(g ** 2) for g in grads))
    if total_norm > max_norm:
        scale = max_norm / total_norm
        return [g * scale for g in grads]
    return grads

def numerical_gradient(param, idx, loss_fn, eps=1e-5):
    """Verify analytical gradients with finite differences."""
    old_val = param.flat[idx]
    param.flat[idx] = old_val + eps; loss_plus = loss_fn()
    param.flat[idx] = old_val - eps; loss_minus = loss_fn()
    param.flat[idx] = old_val
    return (loss_plus - loss_minus) / (2 * eps)

Computational Complexity

For sequence length $T$ , hidden size $H$ , vocabulary size $V$ :

	Time	Memory
Forward pass	$O(T \cdot H \cdot V)$	$O(T \cdot (H + V))$
Backward pass	$O(T \cdot H \cdot V)$	Same (reuses forward cache)

BPTT has the same time complexity as the forward pass. Memory scales linearly with sequence length because we store all hidden states for backprop.

Exercises

Gradient accumulation: In a 3-step sequence with $h_{-1} = 0$ , why does $t=0$ contribute nothing to $\frac{\partial L}{\partial W_{hh}}$ ?
Answer
Because $dW_{hh,0} = dz_0 \cdot h_{-1}^T = dz_0 \cdot \mathbf{0}^T = \mathbf{0}$ .
Gradient depth: In a 5-step sequence, how many times does $W_{hh}$ appear in the gradient path from $L_4$ to $h_0$ ?
Answer
4 times — one multiplication by $W_{hh}^T$ per step backward. This is why gradients vanish/explode over long sequences.
Implement and verify: Write BPTT and verify with numerical gradient checking (error < 1e-5 for all parameters).

Next Steps

You now understand how gradients flow backward through an RNN. But there’s a fundamental problem: for long sequences, the repeated multiplication by $W_{hh}^T$ causes gradients to either vanish (most common) or explode. Next up: Vanishing Gradients & Gated RNNs →, where we’ll see why this happens and the architectures that fix it.