Self-Attention

The Problem: What Does “It” Mean?

Read this sentence:

“The cat sat on the mat because it was tired.”

What does “it” refer to? Obviously the cat — not the mat. You know this because you understand the meaning of “tired” and can connect it back to a living thing.

Now consider:

“The cat sat on the mat because it was dirty.”

Now “it” refers to the mat! Same sentence structure, different meaning based on context.

For a neural network to understand language, it needs a mechanism to figure out which other words are relevant to each word. That mechanism is self-attention.

The Key Insight

Self-attention lets every token in a sequence “look at” every other token and decide:

“Who should I pay attention to?”
“How much should I pay attention to them?”
“What information should I take from them?”

The Q, K, V Framework

Self-attention uses three concepts borrowed from information retrieval:

The Library Analogy

Imagine you’re in a library:

Query (Q): Your question — “I need information about sleeping animals”
Keys (K): Each book’s title/tags — “Cat Behavior”, “Mat Materials”, “Animal Sleep”
Values (V): Each book’s actual content — the detailed information inside

You match your query against all keys to find the most relevant books, then read the content (values) of the matching ones.

Self-attention works the same way:

Each token generates a Query: “What am I looking for?”
Each token generates a Key: “What information do I have?”
Each token generates a Value: “What will I contribute if attended to?”

Why Three Separate Projections?

Why not just use the raw embeddings?

Without projections, the dot product of a vector with itself is always the largest — every token would attend mostly to itself. Separate Q, K, V projections let the model learn:

What aspects to query for (Q)
What aspects to advertise (K)
What information to provide (V)

These can be completely different! A word might query for “what noun am I modifying?” (Q) while advertising “I’m an adjective” (K) and providing “here’s my semantic meaning” (V).

🤔 Quick Check

Why does self-attention use separate Q, K, V projections instead of using the raw token embeddings directly?

Step-by-Step: The Math with Real Numbers

Let’s trace through self-attention with actual numbers. Click “Next” to advance through each computation step:

Self-Attention Step by Step

Step 1/6: Input Embeddings (X)

Input X (3 × 4)

"I"

-1.00

0.74

0.37

-0.98

"love"

-0.34

-0.63

-0.87

0.06

"cats"

0.83

0.02

0.14

0.47

The Formula

The complete self-attention operation in one line:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Let’s break this down:

Step	Operation	What It Does
1	$Q = XW^Q$	Project each token to a query vector
2	$K = XW^K$	Project each token to a key vector
3	$V = XW^V$	Project each token to a value vector
4	$QK^T$	Compute similarity scores between all pairs
5	$\div \sqrt{d_k}$	Scale scores to prevent gradient vanishing
6	softmax	Convert scores to weights (sum to 1 per row)
7	$\times V$	Weighted combination of value vectors

def scaled_dot_product_attention(Q, K, V):
    """
    Args:
        Q: Queries (seq_len, d_k)
        K: Keys    (seq_len, d_k)
        V: Values  (seq_len, d_v)
    Returns:
        output: (seq_len, d_v)
        weights: (seq_len, seq_len) attention matrix
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute dot product scores
    scores = Q @ K.T                    # (seq_len, seq_len)
    
    # Step 2: Scale
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Softmax each row
    weights = softmax(scores, axis=-1)  # Each row sums to 1
    
    # Step 4: Weighted sum of values
    output = weights @ V                # (seq_len, d_v)
    
    return output, weights

Why √d_k Scaling?

This is one of the most common “why?” questions about attention. Here’s the intuition:

The Problem

When $d_k$ is large, dot products have large variance:

$Q \cdot K = \sum_{i=1}^{d_k} q_i k_i$

If each $q_i, k_i \sim N(0, 1)$ , then $Q \cdot K$ has variance $d_k$ .

For $d_k = 64$ : standard deviation $= \sqrt{64} = 8$ . Scores can easily be ±20 or more.

The Consequence

Large scores push softmax into saturation — outputs become nearly one-hot:

	Without Scaling ( $d_k = 64$ )	With Scaling ( $\div \sqrt{64} = 8$ )
Raw scores	`[30, 25, -10, 5]`	`[3.75, 3.125, -1.25, 0.625]`
After softmax	`[0.993, 0.007, 0.000, 0.000]`	`[0.42, 0.32, 0.04, 0.22]`
Attention	Almost all on one token ❌	Spread across tokens ✅
Gradients	Near-zero (vanishing) ❌	Healthy range ✅

Saturated softmax → vanishing gradients → network can’t learn. Scaling by $\sqrt{d_k}$ keeps the variance at 1, and softmax stays in its healthy range.

🤔 Quick Check

What happens to the attention distribution if we DON'T scale by √d_k and d_k is large?

Interactive: Explore Attention Patterns

Play with the attention visualizer below. Try different sentences and see how tokens attend to each other:

Self-Attention Visualizer

Causal Mask Temp: 1.0

Input sentence:

Tokens (click to select query)

Click a token to see its attention distribution

Attention Matrix

Q\K

Attention:

0% 100%

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Things to try:

Toggle Causal Mask to see how decoder-style attention works (each token can only see previous tokens)
Increase temperature to see how it softens the attention distribution
Click a token to see its complete attention pattern
Try sentences with ambiguous words like “bank” to see context effects

Causal (Masked) Self-Attention

For language generation (GPT, LLaMA), we need a critical constraint: a token at position $i$ should only attend to positions $0, 1, \ldots, i$ . It can’t see the future!

The Mask

We add $-\infty$ to blocked positions before softmax, so they get zero attention weight:

	“The"	"cat"	"sat"	"on"
"The”	✅ 0	❌ $-\infty$	❌ $-\infty$	❌ $-\infty$
“cat”	✅ 0	✅ 0	❌ $-\infty$	❌ $-\infty$
“sat”	✅ 0	✅ 0	✅ 0	❌ $-\infty$
“on”	✅ 0	✅ 0	✅ 0	✅ 0

Each row shows what a token can see: ✅ = visible, ❌ = blocked (future).

When to Mask

Architecture	Masking	Why
BERT (encoder)	No mask	Understanding needs full context
GPT (decoder)	Causal mask	Can’t see future during generation
T5 encoder	No mask	Input is fully visible
T5 decoder	Causal mask	Output is generated left-to-right

✍️ Fill in the Blanks

In causal self-attention, position i can attend to positions 0 through , preventing the model from seeing tokens.

def causal_mask(seq_len):
    # Upper triangle = blocked (-infinity so softmax gives 0)
    mask = np.triu(np.ones((seq_len, seq_len)), k=1)
    return np.where(mask == 1, -np.inf, 0)

Self-Attention vs. Cross-Attention

Aspect	Self-Attention	Cross-Attention
Q, K, V source	All from same sequence	Q from one, K/V from another
Used in	Every transformer block	Encoder-decoder models only
Purpose	Tokens communicate within a sequence	Decoder reads from encoder
Example	”cat” attends to “sat” in same sentence	French “chat” attends to English “cat”

Self-attention is the fundamental building block. Cross-attention is an extension used only in encoder-decoder architectures.

Variance of Dot Products

Let $q_i, k_i$ be i.i.d. with $E[q_i] = E[k_i] = 0$ and $\text{Var}(q_i) = \text{Var}(k_i) = 1$ .

The dot product $s = \sum_{i=1}^{d_k} q_i k_i$ is a sum of $d_k$ independent random variables.

For each term: $E[q_i k_i] = E[q_i]E[k_i] = 0$ and $\text{Var}(q_i k_i) = E[q_i^2 k_i^2] - (E[q_i k_i])^2 = 1 \cdot 1 - 0 = 1$ .

By independence: $\text{Var}(s) = d_k \cdot 1 = d_k$ .

Dividing by $\sqrt{d_k}$ : $\text{Var}(s / \sqrt{d_k}) = d_k / d_k = 1$ .

The scores now have unit variance regardless of $d_k$ , keeping softmax in its sensitive range.

Key Takeaways

Self-attention lets every token gather information from every other token
Q, K, V projections let the model learn different roles for finding and providing information
√d_k scaling prevents softmax saturation and gradient vanishing
Causal masking prevents tokens from seeing the future (essential for generation)
Self-attention is O(n²) in sequence length — this is both its power and its limitation

Next: Multi-Head Attention

A single attention head gives one perspective. But language is multi-faceted — we need multiple perspectives simultaneously. That’s Multi-Head Attention →