Seq2Seq Attention
Attention was born from a simple problem: how do you translate a long sentence without losing information? The answer changed everything about how neural networks process language.
The Encoder-Decoder Architecture
Many tasks involve transforming one sequence into another:
| Task | Input | Output |
|---|---|---|
| Translation | ”Hello world" | "Bonjour monde” |
| Summarization | Long article | Short summary |
| Q&A | Question + context | Answer |
Vanilla Seq2Seq
Encoder: x₁ → h₁ → h₂ → h₃ → [context vector c]
↓
Decoder: c → s₁ → s₂ → s₃
↓ ↓ ↓
y₁ y₂ y₃
The encoder processes the entire input and compresses it into a single fixed-size vector . The decoder then generates the output using only this one vector.
The Bottleneck Problem
All information about the entire input must fit in one vector (typically 512 floats). For a 100-word sentence, this means massive information loss — especially for details at the start of the input, which are processed first and most likely to be forgotten.
| Input length | Information per word in context vector |
|---|---|
| 10 words | ~51 floats worth |
| 50 words | ~10 floats worth |
| 100 words | ~5 floats worth |
This is like trying to summarize a novel in a single tweet — something has to give.
Attention: The Solution
Key Insight
Instead of one context vector for the entire output, compute a different context for each decoder step by looking back at all encoder states:
Where is the attention weight: “how much should decoder step focus on encoder step ?”
Encoder states: h₁ h₂ h₃ h₄
↓ ↓ ↓ ↓
Attention weights: 0.1 0.6 0.2 0.1 (for decoder step t)
↓ ↓ ↓ ↓
────────────────────
context c_t
For translating “world” → “monde”, attention focuses on (where “world” was encoded), not on all encoder states equally.
Interactive: Cross-Attention Alignment
Step through the decoder to see how attention shifts across source words at each generation step:
Seq2Seq Cross-Attention Alignment
English → French TranslationAttention Mechanisms
Bahdanau (Additive) Attention
The original attention mechanism (Bahdanau et al., 2014) uses a small neural network to compute compatibility:
| Term | What it does |
|---|---|
| Current decoder state (“what am I looking for?”) | |
| Each encoder state (“what information is here?”) | |
| Compatibility score between decoder query and encoder key | |
| Softmax-normalized weight (how much to attend) | |
| Weighted sum of encoder states (the context for this step) |
Luong (Multiplicative) Attention
A simpler variant (Luong et al., 2015) uses dot products instead of a neural network:
| Variant | Score function | Notes |
|---|---|---|
| Dot | Simplest, requires same dimensions | |
| General | Learnable, allows different dimensions | |
| Concat | Same as Bahdanau |
The dot product variant became the basis for self-attention in Transformers.
def bahdanau_attention(s_prev, encoder_outputs, Ws, Wh, v):
"""Bahdanau (additive) attention."""
src_len = encoder_outputs.shape[0]
scores = np.zeros(src_len)
for i in range(src_len):
combined = np.tanh(Ws @ s_prev + Wh @ encoder_outputs[i])
scores[i] = v @ combined
weights = softmax(scores)
context = weights @ encoder_outputs
return context, weights
def luong_attention(s_curr, encoder_outputs, W=None):
"""Luong (multiplicative) attention."""
if W is not None:
scores = encoder_outputs @ W @ s_curr # General
else:
scores = encoder_outputs @ s_curr # Dot
weights = softmax(scores)
context = weights @ encoder_outputs
return context, weightsAttention Alignments
What Attention Learns
For translation “the cat sat” → “le chat assis”:
| le | chat | assis | |
|---|---|---|---|
| the | 0.8 | 0.1 | 0.1 |
| cat | 0.1 | 0.8 | 0.1 |
| sat | 0.1 | 0.1 | 0.8 |
Attention learns to align source and target words — without any explicit alignment supervision!
Non-Monotonic Alignment
For “I am hungry” → “J’ai faim” (French has different word order):
| J’ai | faim | |
|---|---|---|
| I | 0.7 | 0.2 |
| am | 0.2 | 0.1 |
| hungry | 0.1 | 0.7 |
Attention can reorder information — “hungry” maps to “faim” which comes in a different position. This is something the fixed context vector in vanilla seq2seq couldn’t do at all.
Key Equations
| Concept | Equation |
|---|---|
| Context vector | |
| Attention weights | |
| Bahdanau score | |
| Luong dot score |
From Seq2Seq to Self-Attention
Seq2seq attention lets the decoder attend to the encoder. But what if we let every token attend to every other token in the same sequence? That’s self-attention — the core of the Transformer. Next: Self-Attention →