Seq2Seq Attention

Attention was born from a simple problem: how do you translate a long sentence without losing information? The answer changed everything about how neural networks process language.


The Encoder-Decoder Architecture

Many tasks involve transforming one sequence into another:

TaskInputOutput
Translation”Hello world""Bonjour monde”
SummarizationLong articleShort summary
Q&AQuestion + contextAnswer

Vanilla Seq2Seq

Encoder:  x₁ → h₁ → h₂ → h₃ → [context vector c]

Decoder:                      c → s₁ → s₂ → s₃
                                  ↓    ↓    ↓
                                 y₁   y₂   y₃

The encoder processes the entire input and compresses it into a single fixed-size vector cc. The decoder then generates the output using only this one vector.

The Bottleneck Problem

All information about the entire input must fit in one vector (typically 512 floats). For a 100-word sentence, this means massive information loss — especially for details at the start of the input, which are processed first and most likely to be forgotten.

Input lengthInformation per word in context vector
10 words~51 floats worth
50 words~10 floats worth
100 words~5 floats worth

This is like trying to summarize a novel in a single tweet — something has to give.


Attention: The Solution

Key Insight

Instead of one context vector for the entire output, compute a different context for each decoder step by looking back at all encoder states:

ct=i=1Txαt,ihic_t = \sum_{i=1}^{T_x} \alpha_{t,i} \cdot h_i

Where αt,i\alpha_{t,i} is the attention weight: “how much should decoder step tt focus on encoder step ii?”

Encoder states:     h₁    h₂    h₃    h₄
                     ↓     ↓     ↓     ↓
Attention weights:  0.1   0.6   0.2   0.1  (for decoder step t)
                     ↓     ↓     ↓     ↓
                    ────────────────────
                          context c_t

For translating “world”“monde”, attention focuses on h2h_2 (where “world” was encoded), not on all encoder states equally.

🤔 Quick Check
What problem does attention solve in seq2seq models?

Interactive: Cross-Attention Alignment

Step through the decoder to see how attention shifts across source words at each generation step:

Seq2Seq Cross-Attention Alignment

English → French Translation
Source (EN):
The 60%
cat 13%
sat 8%
on 5%
the 7%
mat 7%
Target (FR):
Decoder step 1/7: generating "Le"
💡 At each decoder step, the model **attends back** to the source sentence to find the most relevant words. Notice how "Le" aligns to "The" — the semantically corresponding word.

Attention Mechanisms

Bahdanau (Additive) Attention

The original attention mechanism (Bahdanau et al., 2014) uses a small neural network to compute compatibility:

et,i=vTtanh(Wsst1+Whhi)e_{t,i} = v^T \tanh(W_s s_{t-1} + W_h h_i) αt,i=exp(et,i)jexp(et,j)\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_j \exp(e_{t,j})} ct=iαt,ihic_t = \sum_i \alpha_{t,i} \cdot h_i

TermWhat it does
st1s_{t-1}Current decoder state (“what am I looking for?”)
hih_iEach encoder state (“what information is here?”)
et,ie_{t,i}Compatibility score between decoder query and encoder key
αt,i\alpha_{t,i}Softmax-normalized weight (how much to attend)
ctc_tWeighted sum of encoder states (the context for this step)

Luong (Multiplicative) Attention

A simpler variant (Luong et al., 2015) uses dot products instead of a neural network:

VariantScore functionNotes
Dotet,i=stThie_{t,i} = s_t^T h_iSimplest, requires same dimensions
Generalet,i=stTWhie_{t,i} = s_t^T W h_iLearnable, allows different dimensions
Concatet,i=vTtanh(W[st;hi])e_{t,i} = v^T \tanh(W[s_t; h_i])Same as Bahdanau

The dot product variant became the basis for self-attention in Transformers.

✍️ Fill in the Blanks
In attention, the decoder state acts as a , the encoder states act as keys, and the context vector is a sum of the encoder states.
def bahdanau_attention(s_prev, encoder_outputs, Ws, Wh, v):
    """Bahdanau (additive) attention."""
    src_len = encoder_outputs.shape[0]
    scores = np.zeros(src_len)
    for i in range(src_len):
        combined = np.tanh(Ws @ s_prev + Wh @ encoder_outputs[i])
        scores[i] = v @ combined
    weights = softmax(scores)
    context = weights @ encoder_outputs
    return context, weights

def luong_attention(s_curr, encoder_outputs, W=None):
    """Luong (multiplicative) attention."""
    if W is not None:
        scores = encoder_outputs @ W @ s_curr  # General
    else:
        scores = encoder_outputs @ s_curr       # Dot
    weights = softmax(scores)
    context = weights @ encoder_outputs
    return context, weights

Attention Alignments

What Attention Learns

For translation “the cat sat”“le chat assis”:

lechatassis
the0.80.10.1
cat0.10.80.1
sat0.10.10.8

Attention learns to align source and target words — without any explicit alignment supervision!

Non-Monotonic Alignment

For “I am hungry”“J’ai faim” (French has different word order):

J’aifaim
I0.70.2
am0.20.1
hungry0.10.7

Attention can reorder information — “hungry” maps to “faim” which comes in a different position. This is something the fixed context vector in vanilla seq2seq couldn’t do at all.

🤔 Quick Check
In a French-to-English translation, the attention weights form a roughly diagonal pattern. What does this tell you?

Key Equations

ConceptEquation
Context vectorct=i=1Txαt,ihic_t = \sum_{i=1}^{T_x} \alpha_{t,i} \cdot h_i
Attention weightsαt,i=softmax(et,i)\alpha_{t,i} = \text{softmax}(e_{t,i})
Bahdanau scoreet,i=vTtanh(Wsst1+Whhi)e_{t,i} = v^T \tanh(W_s s_{t-1} + W_h h_i)
Luong dot scoreet,i=stThie_{t,i} = s_t^T h_i

From Seq2Seq to Self-Attention

Seq2seq attention lets the decoder attend to the encoder. But what if we let every token attend to every other token in the same sequence? That’s self-attention — the core of the Transformer. Next: Self-Attention →