Forward Pass

In the introduction, we learned what an RNN does — it processes sequences while maintaining memory. Now let’s see how it actually computes predictions, step by step.

Step 1: Encoding the Input

Computers don’t understand words — they understand numbers. So we first convert each character or word into a vector.

One-Hot Encoding

The simplest approach: a vector of all 0s except for a single 1 at the position of our character.

If our vocabulary is ['a', 'b', 'c', 'd']:

‘a’ → [1, 0, 0, 0]
‘b’ → [0, 1, 0, 0]
‘c’ → [0, 0, 1, 0]
‘d’ → [0, 0, 0, 1]

$x_t = [0, 0, ..., 1, ..., 0]^T$

The vector has exactly one “hot” (1) element — hence the name one-hot encoding.

🤔 Quick Check

If our vocabulary has 1000 characters, how many elements are in each one-hot vector?

Step 2: Updating the Hidden State

This is the core of an RNN. The network combines two things:

What it just saw — the current input $x_t$
What it remembers — the previous hidden state $h_{t-1}$

$h_t = \tanh\!\left(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h\right)$

Let’s break this down with a concrete example. Imagine processing “to be” and we just reached the letter 'e':

Term	What it computes	Intuition
$W_{xh} \cdot x_t$	Input contribution	”What signal does `'e'` carry?”
$W_{hh} \cdot h_{t-1}$	Memory contribution	”What did we learn from `'to b'`?”
$b_h$	Bias	Learned offset, adds flexibility
$\tanh(\cdot)$	Squashing function	Keeps values in $[-1, 1]$ so they don’t explode

Why tanh? Without it, each step multiplies values by weight matrices — they’d grow or shrink without bound. The $\tanh$ function is a “pressure valve”: no matter what’s fed in, the output stays between −1 and 1.

✍️ Fill in the Blanks

The hidden state is computed by combining the input with the hidden state, then applying a non-linear activation.

Step 3: Computing Output Probabilities

Now we turn the hidden state into a prediction: “What’s the most likely next character?”

Raw Scores (Logits)

First, a linear transformation gives us one raw score per vocabulary item:

$y_t = W_{hy} \cdot h_t + b_y$

These “logits” can be any real number. Higher score = network thinks that character is more likely.

Softmax: From Scores to Probabilities

Logits don’t sum to 1, so they’re not proper probabilities yet. Softmax fixes that:

$p_i = \frac{e^{y_i}}{\sum_j e^{y_j}}$

What softmax does:

Exponentiates each score (makes everything positive)
Divides by the total (makes everything sum to 1)
Higher score → higher probability, but the gap is amplified

🤔 Quick Check

If the logits before softmax are [2.0, 1.0, 0.1], which statement is true about the softmax output?

Step 4: Measuring the Error (Loss)

How do we know if our prediction was good? We compare it to what actually came next using cross-entropy loss:

$L_t = -\log\!\left(p_t[\text{target}]\right)$

Intuition:

Predicted the correct character with probability 0.9 → loss = $-\log(0.9) = 0.10$ ✅ low
Predicted it with probability 0.01 → loss = $-\log(0.01) = 4.6$ ❌ high

The total loss sums across all timesteps:

$L = \sum_t L_t = -\sum_t \log\!\left(p_t[\text{target}_t]\right)$

See It in Action

Step through each word and watch how the hidden state (memory) updates and predictions shift:

→ → →

Memory before

Empty — nothing read yet

memory pattern

reads "I"

↓

mix old memory
+ new word,
then compress

↓

Memory after

"I"

updated pattern

💬 I read my first word: "I". My memory activates for the first time. Based on just this one word, my best guess for the next word is "my" (50% confident).

🎯 Next word prediction:

"my"

50%

"I"

22%

"love"

16%

Step 1 of 4

Notice: After reading just "I", the memory is sparse. After "I love my", the memory encodes much richer context — and the predictions reflect that.

The Complete Forward Pass

Here’s the full algorithm in Python. Notice the numerically stable softmax (subtract max before exponentiation):

def forward_pass(inputs, targets, h_prev):
    """
    Process a sequence and compute loss.

    Args:
        inputs:  List of character indices [t0, t1, ...]
        targets: List of target indices, shifted by 1
        h_prev:  Initial hidden state (hidden_size x 1)

    Returns:
        loss:   Total cross-entropy loss
        cache:  Intermediate values needed for backprop
    """
    xs, hs, ys, ps = {}, {}, {}, {}
    hs[-1] = h_prev
    loss = 0

    for t in range(len(inputs)):
        # Step 1: One-hot encode input
        xs[t] = np.zeros((vocab_size, 1))
        xs[t][inputs[t]] = 1

        # Step 2: Update hidden state
        hs[t] = np.tanh(Wxh @ xs[t] + Whh @ hs[t-1] + bh)

        # Step 3: Compute output logits
        ys[t] = Why @ hs[t] + by

        # Step 4: Numerically stable softmax
        shifted = ys[t] - np.max(ys[t])
        ps[t] = np.exp(shifted) / np.sum(np.exp(shifted))

        # Step 5: Accumulate cross-entropy loss
        loss += -np.log(ps[t][targets[t], 0] + 1e-8)

    return loss, (xs, hs, ys, ps)

🤔 Quick Check

In the forward pass, why do we store all the intermediate values (xs, hs, ys, ps)?

Explore the Computation Graph

Ready to go deeper? This interactive graph lets you inspect every single node — one-hot vectors, weight matrix multiplications, hidden states, logits, and softmax outputs — as the RNN processes a sentence character by character:

RNN Computation Graph

Forward

vectors matrices ops activations

100%

Input

step 1/0

Mode

V=0 H=4 L=—

What to look for:

xs (blue): One-hot encoded inputs — sparse vectors with a single 1
hs (green): Hidden states — notice they change at every timestep
ys (orange): Output logits before softmax
ps (purple): Final probability distributions

Quick Summary

Step	What happens	Math
1. Encode	Convert input to vector	$x_t = \text{one\_hot}(\text{input}_t)$
2. Update	Combine input + memory	$h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)$
3. Output	Compute raw scores	$y_t = W_{hy} h_t + b_y$
4. Softmax	Convert to probabilities	$p_t = \text{softmax}(y_t)$
5. Loss	Measure prediction error	$L_t = -\log(p_t[\text{target}])$

Numerical Stability in Softmax

The raw softmax formula can silently overflow when logits are large (e.g., $e^{1000} = \infty$ ). The fix is to subtract the max before exponentiation:

shifted = y - np.max(y)          # safe: largest value becomes 0
p = np.exp(shifted) / np.sum(np.exp(shifted))

Subtracting the max doesn’t change the final probabilities (the constant cancels in the division) but prevents floating-point overflow.

Why Cross-Entropy?

Cross-entropy has two properties that make it ideal for classification:

Simple gradient: $\frac{\partial L}{\partial y_i} = p_i - \mathbf{1}_{i=\text{target}}$ — just subtract 1 from the target’s probability
Punishes confident wrong answers: Being 99% sure of the wrong answer is penalised far more than being 51% sure
Probabilistic grounding: Minimising cross-entropy is equivalent to maximising the likelihood of the training data

Truncated Backpropagation Through Time

For very long sequences (an entire book), backpropagating through every single timestep is impractical. Instead:

Process a chunk (e.g., 100 characters)
Compute gradients for that chunk only
Move to the next chunk, carrying the hidden state forward

The hidden state preserves context across chunks, but gradients only flow within each chunk. This trades a small amount of gradient accuracy for a dramatic reduction in memory.

Temperature Sampling

When generating text (not training), we can control creativity with temperature $\tau$ :

$p_i = \frac{e^{y_i / \tau}}{\sum_j e^{y_j / \tau}}$

$\tau < 1$ : Sharpens the distribution — more confident, less varied output
$\tau > 1$ : Flattens the distribution — more random, more creative output
$\tau \to 0$ : Always picks the single highest-probability token (greedy decoding)

Next Steps

You now know exactly how an RNN makes predictions. But how does it learn — how do the weights $W_{xh}$ , $W_{hh}$ , $W_{hy}$ get tuned? Next up: Backward Pass & BPTT →, where gradients flow backwards through the sequence to update every parameter.