Seq2Seq Attention

Attention was born from a simple problem: how do you translate a long sentence without losing information? The answer changed everything about how neural networks process language.

The Encoder-Decoder Architecture

Many tasks involve transforming one sequence into another:

Task	Input	Output
Translation	”Hello world"	"Bonjour monde”
Summarization	Long article	Short summary
Q&A	Question + context	Answer

Vanilla Seq2Seq

Encoder:  x₁ → h₁ → h₂ → h₃ → [context vector c]
                                    ↓
Decoder:                      c → s₁ → s₂ → s₃
                                  ↓    ↓    ↓
                                 y₁   y₂   y₃

The encoder processes the entire input and compresses it into a single fixed-size vector $c$ . The decoder then generates the output using only this one vector.

The Bottleneck Problem

All information about the entire input must fit in one vector (typically 512 floats). For a 100-word sentence, this means massive information loss — especially for details at the start of the input, which are processed first and most likely to be forgotten.

Input length	Information per word in context vector
10 words	~51 floats worth
50 words	~10 floats worth
100 words	~5 floats worth

This is like trying to summarize a novel in a single tweet — something has to give.

Attention: The Solution

Key Insight

Instead of one context vector for the entire output, compute a different context for each decoder step by looking back at all encoder states:

$c_t = \sum_{i=1}^{T_x} \alpha_{t,i} \cdot h_i$

Where $\alpha_{t,i}$ is the attention weight: “how much should decoder step $t$ focus on encoder step $i$ ?”

Encoder states:     h₁    h₂    h₃    h₄
                     ↓     ↓     ↓     ↓
Attention weights:  0.1   0.6   0.2   0.1  (for decoder step t)
                     ↓     ↓     ↓     ↓
                    ────────────────────
                          context c_t

For translating “world” → “monde”, attention focuses on $h_2$ (where “world” was encoded), not on all encoder states equally.

🤔 Quick Check

What problem does attention solve in seq2seq models?

Interactive: Cross-Attention Alignment

Step through the decoder to see how attention shifts across source words at each generation step:

Seq2Seq Cross-Attention Alignment

English → French Translation

Source (EN):

The 60%

cat 13%

sat 8%

on 5%

the 7%

mat 7%

Target (FR):

Decoder step 1/7: generating "Le"

💡 At each decoder step, the model **attends back** to the source sentence to find the most relevant words. Notice how "Le" aligns to "The" — the semantically corresponding word.

Attention Mechanisms

Bahdanau (Additive) Attention

The original attention mechanism (Bahdanau et al., 2014) uses a small neural network to compute compatibility:

$e_{t,i} = v^T \tanh(W_s s_{t-1} + W_h h_i)$ $\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_j \exp(e_{t,j})}$ $c_t = \sum_i \alpha_{t,i} \cdot h_i$

Term	What it does
$s_{t-1}$	Current decoder state (“what am I looking for?”)
$h_i$	Each encoder state (“what information is here?”)
$e_{t,i}$	Compatibility score between decoder query and encoder key
$\alpha_{t,i}$	Softmax-normalized weight (how much to attend)
$c_t$	Weighted sum of encoder states (the context for this step)

Luong (Multiplicative) Attention

A simpler variant (Luong et al., 2015) uses dot products instead of a neural network:

Variant	Score function	Notes
Dot	$e_{t,i} = s_t^T h_i$	Simplest, requires same dimensions
General	$e_{t,i} = s_t^T W h_i$	Learnable, allows different dimensions
Concat	$e_{t,i} = v^T \tanh(W[s_t; h_i])$	Same as Bahdanau

The dot product variant became the basis for self-attention in Transformers.

✍️ Fill in the Blanks

In attention, the decoder state acts as a , the encoder states act as keys, and the context vector is a sum of the encoder states.

def bahdanau_attention(s_prev, encoder_outputs, Ws, Wh, v):
    """Bahdanau (additive) attention."""
    src_len = encoder_outputs.shape[0]
    scores = np.zeros(src_len)
    for i in range(src_len):
        combined = np.tanh(Ws @ s_prev + Wh @ encoder_outputs[i])
        scores[i] = v @ combined
    weights = softmax(scores)
    context = weights @ encoder_outputs
    return context, weights

def luong_attention(s_curr, encoder_outputs, W=None):
    """Luong (multiplicative) attention."""
    if W is not None:
        scores = encoder_outputs @ W @ s_curr  # General
    else:
        scores = encoder_outputs @ s_curr       # Dot
    weights = softmax(scores)
    context = weights @ encoder_outputs
    return context, weights

Attention Alignments

What Attention Learns

For translation “the cat sat” → “le chat assis”:

	le	chat	assis
the	0.8	0.1	0.1
cat	0.1	0.8	0.1
sat	0.1	0.1	0.8

Attention learns to align source and target words — without any explicit alignment supervision!

Non-Monotonic Alignment

For “I am hungry” → “J’ai faim” (French has different word order):

	J’ai	faim
I	0.7	0.2
am	0.2	0.1
hungry	0.1	0.7

Attention can reorder information — “hungry” maps to “faim” which comes in a different position. This is something the fixed context vector in vanilla seq2seq couldn’t do at all.

🤔 Quick Check

In a French-to-English translation, the attention weights form a roughly diagonal pattern. What does this tell you?

Key Equations

Concept	Equation
Context vector	$c_t = \sum_{i=1}^{T_x} \alpha_{t,i} \cdot h_i$
Attention weights	$\alpha_{t,i} = \text{softmax}(e_{t,i})$
Bahdanau score	$e_{t,i} = v^T \tanh(W_s s_{t-1} + W_h h_i)$
Luong dot score	$e_{t,i} = s_t^T h_i$

From Seq2Seq to Self-Attention

Seq2seq attention lets the decoder attend to the encoder. But what if we let every token attend to every other token in the same sequence? That’s self-attention — the core of the Transformer. Next: Self-Attention →