Forward Pass
In the introduction, we learned what an RNN does — it processes sequences while maintaining memory. Now let’s see how it actually computes predictions, step by step.
Step 1: Encoding the Input
Computers don’t understand words — they understand numbers. So we first convert each character or word into a vector.
One-Hot Encoding
The simplest approach: a vector of all 0s except for a single 1 at the position of our character.
If our vocabulary is ['a', 'b', 'c', 'd']:
- ‘a’ →
[1, 0, 0, 0] - ‘b’ →
[0, 1, 0, 0] - ‘c’ →
[0, 0, 1, 0] - ‘d’ →
[0, 0, 0, 1]
The vector has exactly one “hot” (1) element — hence the name one-hot encoding.
Step 2: Updating the Hidden State
This is the core of an RNN. The network combines two things:
- What it just saw — the current input
- What it remembers — the previous hidden state
Let’s break this down with a concrete example. Imagine processing “to be” and we just reached the letter 'e':
| Term | What it computes | Intuition |
|---|---|---|
| Input contribution | ”What signal does 'e' carry?” | |
| Memory contribution | ”What did we learn from 'to b'?” | |
| Bias | Learned offset, adds flexibility | |
| Squashing function | Keeps values in so they don’t explode |
Why tanh? Without it, each step multiplies values by weight matrices — they’d grow or shrink without bound. The function is a “pressure valve”: no matter what’s fed in, the output stays between −1 and 1.
Step 3: Computing Output Probabilities
Now we turn the hidden state into a prediction: “What’s the most likely next character?”
Raw Scores (Logits)
First, a linear transformation gives us one raw score per vocabulary item:
These “logits” can be any real number. Higher score = network thinks that character is more likely.
Softmax: From Scores to Probabilities
Logits don’t sum to 1, so they’re not proper probabilities yet. Softmax fixes that:
What softmax does:
- Exponentiates each score (makes everything positive)
- Divides by the total (makes everything sum to 1)
- Higher score → higher probability, but the gap is amplified
Step 4: Measuring the Error (Loss)
How do we know if our prediction was good? We compare it to what actually came next using cross-entropy loss:
Intuition:
- Predicted the correct character with probability 0.9 → loss = ✅ low
- Predicted it with probability 0.01 → loss = ❌ high
The total loss sums across all timesteps:
See It in Action
Step through each word and watch how the hidden state (memory) updates and predictions shift:
+ new word,
then compress
Notice: After reading just "I", the memory is sparse. After "I love my", the memory encodes much richer context — and the predictions reflect that.
The Complete Forward Pass
Here’s the full algorithm in Python. Notice the numerically stable softmax (subtract max before exponentiation):
def forward_pass(inputs, targets, h_prev):
"""
Process a sequence and compute loss.
Args:
inputs: List of character indices [t0, t1, ...]
targets: List of target indices, shifted by 1
h_prev: Initial hidden state (hidden_size x 1)
Returns:
loss: Total cross-entropy loss
cache: Intermediate values needed for backprop
"""
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = h_prev
loss = 0
for t in range(len(inputs)):
# Step 1: One-hot encode input
xs[t] = np.zeros((vocab_size, 1))
xs[t][inputs[t]] = 1
# Step 2: Update hidden state
hs[t] = np.tanh(Wxh @ xs[t] + Whh @ hs[t-1] + bh)
# Step 3: Compute output logits
ys[t] = Why @ hs[t] + by
# Step 4: Numerically stable softmax
shifted = ys[t] - np.max(ys[t])
ps[t] = np.exp(shifted) / np.sum(np.exp(shifted))
# Step 5: Accumulate cross-entropy loss
loss += -np.log(ps[t][targets[t], 0] + 1e-8)
return loss, (xs, hs, ys, ps)
Explore the Computation Graph
Ready to go deeper? This interactive graph lets you inspect every single node — one-hot vectors, weight matrix multiplications, hidden states, logits, and softmax outputs — as the RNN processes a sentence character by character:
RNN Computation Graph
What to look for:
- xs (blue): One-hot encoded inputs — sparse vectors with a single 1
- hs (green): Hidden states — notice they change at every timestep
- ys (orange): Output logits before softmax
- ps (purple): Final probability distributions
Quick Summary
| Step | What happens | Math |
|---|---|---|
| 1. Encode | Convert input to vector | |
| 2. Update | Combine input + memory | |
| 3. Output | Compute raw scores | |
| 4. Softmax | Convert to probabilities | |
| 5. Loss | Measure prediction error |
Numerical Stability in Softmax
The raw softmax formula can silently overflow when logits are large (e.g., ). The fix is to subtract the max before exponentiation:
shifted = y - np.max(y) # safe: largest value becomes 0
p = np.exp(shifted) / np.sum(np.exp(shifted))Subtracting the max doesn’t change the final probabilities (the constant cancels in the division) but prevents floating-point overflow.
Why Cross-Entropy?
Cross-entropy has two properties that make it ideal for classification:
- Simple gradient: — just subtract 1 from the target’s probability
- Punishes confident wrong answers: Being 99% sure of the wrong answer is penalised far more than being 51% sure
- Probabilistic grounding: Minimising cross-entropy is equivalent to maximising the likelihood of the training data
Truncated Backpropagation Through Time
For very long sequences (an entire book), backpropagating through every single timestep is impractical. Instead:
- Process a chunk (e.g., 100 characters)
- Compute gradients for that chunk only
- Move to the next chunk, carrying the hidden state forward
The hidden state preserves context across chunks, but gradients only flow within each chunk. This trades a small amount of gradient accuracy for a dramatic reduction in memory.
Temperature Sampling
When generating text (not training), we can control creativity with temperature :
- : Sharpens the distribution — more confident, less varied output
- : Flattens the distribution — more random, more creative output
- : Always picks the single highest-probability token (greedy decoding)
Next Steps
You now know exactly how an RNN makes predictions. But how does it learn — how do the weights , , get tuned? Next up: Backward Pass & BPTT →, where gradients flow backwards through the sequence to update every parameter.