Self-Attention
The Problem: What Does “It” Mean?
Read this sentence:
“The cat sat on the mat because it was tired.”
What does “it” refer to? Obviously the cat — not the mat. You know this because you understand the meaning of “tired” and can connect it back to a living thing.
Now consider:
“The cat sat on the mat because it was dirty.”
Now “it” refers to the mat! Same sentence structure, different meaning based on context.
For a neural network to understand language, it needs a mechanism to figure out which other words are relevant to each word. That mechanism is self-attention.
The Key Insight
Self-attention lets every token in a sequence “look at” every other token and decide:
- “Who should I pay attention to?”
- “How much should I pay attention to them?”
- “What information should I take from them?”
The Q, K, V Framework
Self-attention uses three concepts borrowed from information retrieval:
The Library Analogy
Imagine you’re in a library:
- Query (Q): Your question — “I need information about sleeping animals”
- Keys (K): Each book’s title/tags — “Cat Behavior”, “Mat Materials”, “Animal Sleep”
- Values (V): Each book’s actual content — the detailed information inside
You match your query against all keys to find the most relevant books, then read the content (values) of the matching ones.
Self-attention works the same way:
- Each token generates a Query: “What am I looking for?”
- Each token generates a Key: “What information do I have?”
- Each token generates a Value: “What will I contribute if attended to?”
Why Three Separate Projections?
Why not just use the raw embeddings?
Without projections, the dot product of a vector with itself is always the largest — every token would attend mostly to itself. Separate Q, K, V projections let the model learn:
- What aspects to query for (Q)
- What aspects to advertise (K)
- What information to provide (V)
These can be completely different! A word might query for “what noun am I modifying?” (Q) while advertising “I’m an adjective” (K) and providing “here’s my semantic meaning” (V).
Step-by-Step: The Math with Real Numbers
Let’s trace through self-attention with actual numbers. Click “Next” to advance through each computation step:
Self-Attention Step by Step
The Formula
The complete self-attention operation in one line:
Let’s break this down:
| Step | Operation | What It Does |
|---|---|---|
| 1 | Project each token to a query vector | |
| 2 | Project each token to a key vector | |
| 3 | Project each token to a value vector | |
| 4 | Compute similarity scores between all pairs | |
| 5 | Scale scores to prevent gradient vanishing | |
| 6 | softmax | Convert scores to weights (sum to 1 per row) |
| 7 | Weighted combination of value vectors |
def scaled_dot_product_attention(Q, K, V):
"""
Args:
Q: Queries (seq_len, d_k)
K: Keys (seq_len, d_k)
V: Values (seq_len, d_v)
Returns:
output: (seq_len, d_v)
weights: (seq_len, seq_len) attention matrix
"""
d_k = Q.shape[-1]
# Step 1: Compute dot product scores
scores = Q @ K.T # (seq_len, seq_len)
# Step 2: Scale
scores = scores / np.sqrt(d_k)
# Step 3: Softmax each row
weights = softmax(scores, axis=-1) # Each row sums to 1
# Step 4: Weighted sum of values
output = weights @ V # (seq_len, d_v)
return output, weightsWhy √d_k Scaling?
This is one of the most common “why?” questions about attention. Here’s the intuition:
The Problem
When is large, dot products have large variance:
If each , then has variance .
For : standard deviation . Scores can easily be ±20 or more.
The Consequence
Large scores push softmax into saturation — outputs become nearly one-hot:
| Without Scaling () | With Scaling () | |
|---|---|---|
| Raw scores | [30, 25, -10, 5] | [3.75, 3.125, -1.25, 0.625] |
| After softmax | [0.993, 0.007, 0.000, 0.000] | [0.42, 0.32, 0.04, 0.22] |
| Attention | Almost all on one token ❌ | Spread across tokens ✅ |
| Gradients | Near-zero (vanishing) ❌ | Healthy range ✅ |
Saturated softmax → vanishing gradients → network can’t learn. Scaling by keeps the variance at 1, and softmax stays in its healthy range.
Interactive: Explore Attention Patterns
Play with the attention visualizer below. Try different sentences and see how tokens attend to each other:
Self-Attention Visualizer
Things to try:
- Toggle Causal Mask to see how decoder-style attention works (each token can only see previous tokens)
- Increase temperature to see how it softens the attention distribution
- Click a token to see its complete attention pattern
- Try sentences with ambiguous words like “bank” to see context effects
Causal (Masked) Self-Attention
For language generation (GPT, LLaMA), we need a critical constraint: a token at position should only attend to positions . It can’t see the future!
The Mask
We add to blocked positions before softmax, so they get zero attention weight:
| “The" | "cat" | "sat" | "on" | |
|---|---|---|---|---|
| "The” | ✅ 0 | ❌ | ❌ | ❌ |
| “cat” | ✅ 0 | ✅ 0 | ❌ | ❌ |
| “sat” | ✅ 0 | ✅ 0 | ✅ 0 | ❌ |
| “on” | ✅ 0 | ✅ 0 | ✅ 0 | ✅ 0 |
Each row shows what a token can see: ✅ = visible, ❌ = blocked (future).
When to Mask
| Architecture | Masking | Why |
|---|---|---|
| BERT (encoder) | No mask | Understanding needs full context |
| GPT (decoder) | Causal mask | Can’t see future during generation |
| T5 encoder | No mask | Input is fully visible |
| T5 decoder | Causal mask | Output is generated left-to-right |
def causal_mask(seq_len):
# Upper triangle = blocked (-infinity so softmax gives 0)
mask = np.triu(np.ones((seq_len, seq_len)), k=1)
return np.where(mask == 1, -np.inf, 0)Self-Attention vs. Cross-Attention
| Aspect | Self-Attention | Cross-Attention |
|---|---|---|
| Q, K, V source | All from same sequence | Q from one, K/V from another |
| Used in | Every transformer block | Encoder-decoder models only |
| Purpose | Tokens communicate within a sequence | Decoder reads from encoder |
| Example | ”cat” attends to “sat” in same sentence | French “chat” attends to English “cat” |
Self-attention is the fundamental building block. Cross-attention is an extension used only in encoder-decoder architectures.
Variance of Dot Products
Let be i.i.d. with and .
The dot product is a sum of independent random variables.
For each term: and .
By independence: .
Dividing by : .
The scores now have unit variance regardless of , keeping softmax in its sensitive range.
Key Takeaways
- Self-attention lets every token gather information from every other token
- Q, K, V projections let the model learn different roles for finding and providing information
- √d_k scaling prevents softmax saturation and gradient vanishing
- Causal masking prevents tokens from seeing the future (essential for generation)
- Self-attention is O(n²) in sequence length — this is both its power and its limitation
Next: Multi-Head Attention
A single attention head gives one perspective. But language is multi-faceted — we need multiple perspectives simultaneously. That’s Multi-Head Attention →