Forward Pass

In the introduction, we learned what an RNN does — it processes sequences while maintaining memory. Now let’s see how it actually computes predictions, step by step.

Step 1: Encoding the Input

Computers don’t understand words — they understand numbers. So we first convert each character or word into a vector.

One-Hot Encoding

The simplest approach: a vector of all 0s except for a single 1 at the position of our character.

If our vocabulary is ['a', 'b', 'c', 'd']:

  • ‘a’ → [1, 0, 0, 0]
  • ‘b’ → [0, 1, 0, 0]
  • ‘c’ → [0, 0, 1, 0]
  • ‘d’ → [0, 0, 0, 1]

xt=[0,0,...,1,...,0]Tx_t = [0, 0, ..., 1, ..., 0]^T

The vector has exactly one “hot” (1) element — hence the name one-hot encoding.

🤔 Quick Check
If our vocabulary has 1000 characters, how many elements are in each one-hot vector?

Step 2: Updating the Hidden State

This is the core of an RNN. The network combines two things:

  • What it just saw — the current input xtx_t
  • What it remembers — the previous hidden state ht1h_{t-1}

ht=tanh ⁣(Wxhxt+Whhht1+bh)h_t = \tanh\!\left(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h\right)

Let’s break this down with a concrete example. Imagine processing “to be” and we just reached the letter 'e':

TermWhat it computesIntuition
WxhxtW_{xh} \cdot x_tInput contribution”What signal does 'e' carry?”
Whhht1W_{hh} \cdot h_{t-1}Memory contribution”What did we learn from 'to b'?”
bhb_hBiasLearned offset, adds flexibility
tanh()\tanh(\cdot)Squashing functionKeeps values in [1,1][-1, 1] so they don’t explode

Why tanh? Without it, each step multiplies values by weight matrices — they’d grow or shrink without bound. The tanh\tanh function is a “pressure valve”: no matter what’s fed in, the output stays between −1 and 1.

✍️ Fill in the Blanks
The hidden state is computed by combining the input with the hidden state, then applying a non-linear activation.

Step 3: Computing Output Probabilities

Now we turn the hidden state into a prediction: “What’s the most likely next character?”

Raw Scores (Logits)

First, a linear transformation gives us one raw score per vocabulary item:

yt=Whyht+byy_t = W_{hy} \cdot h_t + b_y

These “logits” can be any real number. Higher score = network thinks that character is more likely.

Softmax: From Scores to Probabilities

Logits don’t sum to 1, so they’re not proper probabilities yet. Softmax fixes that:

pi=eyijeyjp_i = \frac{e^{y_i}}{\sum_j e^{y_j}}

What softmax does:

  1. Exponentiates each score (makes everything positive)
  2. Divides by the total (makes everything sum to 1)
  3. Higher score → higher probability, but the gap is amplified
🤔 Quick Check
If the logits before softmax are [2.0, 1.0, 0.1], which statement is true about the softmax output?

Step 4: Measuring the Error (Loss)

How do we know if our prediction was good? We compare it to what actually came next using cross-entropy loss:

Lt=log ⁣(pt[target])L_t = -\log\!\left(p_t[\text{target}]\right)

Intuition:

  • Predicted the correct character with probability 0.9 → loss = log(0.9)=0.10-\log(0.9) = 0.10 ✅ low
  • Predicted it with probability 0.01 → loss = log(0.01)=4.6-\log(0.01) = 4.6 ❌ high

The total loss sums across all timesteps:

L=tLt=tlog ⁣(pt[targett])L = \sum_t L_t = -\sum_t \log\!\left(p_t[\text{target}_t]\right)

See It in Action

Step through each word and watch how the hidden state (memory) updates and predictions shift:

🧠
RNN Step-by-Step
Watch how memory builds as each word is read
Memory before
Empty — nothing read yet
memory pattern
reads "I"
mix old memory
+ new word,
then compress
Memory after
"I"
updated pattern
💬 I read my first word: "I". My memory activates for the first time. Based on just this one word, my best guess for the next word is "my" (50% confident).
🎯 Next word prediction:
"my"
50%
"I"
22%
"love"
16%
Step 1 of 4

Notice: After reading just "I", the memory is sparse. After "I love my", the memory encodes much richer context — and the predictions reflect that.

The Complete Forward Pass

Here’s the full algorithm in Python. Notice the numerically stable softmax (subtract max before exponentiation):

def forward_pass(inputs, targets, h_prev):
    """
    Process a sequence and compute loss.

    Args:
        inputs:  List of character indices [t0, t1, ...]
        targets: List of target indices, shifted by 1
        h_prev:  Initial hidden state (hidden_size x 1)

    Returns:
        loss:   Total cross-entropy loss
        cache:  Intermediate values needed for backprop
    """
    xs, hs, ys, ps = {}, {}, {}, {}
    hs[-1] = h_prev
    loss = 0

    for t in range(len(inputs)):
        # Step 1: One-hot encode input
        xs[t] = np.zeros((vocab_size, 1))
        xs[t][inputs[t]] = 1

        # Step 2: Update hidden state
        hs[t] = np.tanh(Wxh @ xs[t] + Whh @ hs[t-1] + bh)

        # Step 3: Compute output logits
        ys[t] = Why @ hs[t] + by

        # Step 4: Numerically stable softmax
        shifted = ys[t] - np.max(ys[t])
        ps[t] = np.exp(shifted) / np.sum(np.exp(shifted))

        # Step 5: Accumulate cross-entropy loss
        loss += -np.log(ps[t][targets[t], 0] + 1e-8)

    return loss, (xs, hs, ys, ps)
🤔 Quick Check
In the forward pass, why do we store all the intermediate values (xs, hs, ys, ps)?

Explore the Computation Graph

Ready to go deeper? This interactive graph lets you inspect every single node — one-hot vectors, weight matrix multiplications, hidden states, logits, and softmax outputs — as the RNN processes a sentence character by character:

RNN Computation Graph

Forward
vectors matrices ops activations
100%
p (softmax)y (logits)h (hidden)x (input)→ forward
V=0 H=4 L=—

What to look for:

  • xs (blue): One-hot encoded inputs — sparse vectors with a single 1
  • hs (green): Hidden states — notice they change at every timestep
  • ys (orange): Output logits before softmax
  • ps (purple): Final probability distributions

Quick Summary

StepWhat happensMath
1. EncodeConvert input to vectorxt=one_hot(inputt)x_t = \text{one\_hot}(\text{input}_t)
2. UpdateCombine input + memoryht=tanh(Wxhxt+Whhht1+bh)h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)
3. OutputCompute raw scoresyt=Whyht+byy_t = W_{hy} h_t + b_y
4. SoftmaxConvert to probabilitiespt=softmax(yt)p_t = \text{softmax}(y_t)
5. LossMeasure prediction errorLt=log(pt[target])L_t = -\log(p_t[\text{target}])

Numerical Stability in Softmax

The raw softmax formula can silently overflow when logits are large (e.g., e1000=e^{1000} = \infty). The fix is to subtract the max before exponentiation:

shifted = y - np.max(y)          # safe: largest value becomes 0
p = np.exp(shifted) / np.sum(np.exp(shifted))

Subtracting the max doesn’t change the final probabilities (the constant cancels in the division) but prevents floating-point overflow.

Why Cross-Entropy?

Cross-entropy has two properties that make it ideal for classification:

  • Simple gradient: Lyi=pi1i=target\frac{\partial L}{\partial y_i} = p_i - \mathbf{1}_{i=\text{target}} — just subtract 1 from the target’s probability
  • Punishes confident wrong answers: Being 99% sure of the wrong answer is penalised far more than being 51% sure
  • Probabilistic grounding: Minimising cross-entropy is equivalent to maximising the likelihood of the training data

Truncated Backpropagation Through Time

For very long sequences (an entire book), backpropagating through every single timestep is impractical. Instead:

  1. Process a chunk (e.g., 100 characters)
  2. Compute gradients for that chunk only
  3. Move to the next chunk, carrying the hidden state forward

The hidden state preserves context across chunks, but gradients only flow within each chunk. This trades a small amount of gradient accuracy for a dramatic reduction in memory.

Temperature Sampling

When generating text (not training), we can control creativity with temperature τ\tau:

pi=eyi/τjeyj/τp_i = \frac{e^{y_i / \tau}}{\sum_j e^{y_j / \tau}}

  • τ<1\tau < 1: Sharpens the distribution — more confident, less varied output
  • τ>1\tau > 1: Flattens the distribution — more random, more creative output
  • τ0\tau \to 0: Always picks the single highest-probability token (greedy decoding)

Next Steps

You now know exactly how an RNN makes predictions. But how does it learn — how do the weights WxhW_{xh}, WhhW_{hh}, WhyW_{hy} get tuned? Next up: Backward Pass & BPTT →, where gradients flow backwards through the sequence to update every parameter.