Self-Attention

The Problem: What Does “It” Mean?

Read this sentence:

“The cat sat on the mat because it was tired.”

What does “it” refer to? Obviously the cat — not the mat. You know this because you understand the meaning of “tired” and can connect it back to a living thing.

Now consider:

“The cat sat on the mat because it was dirty.”

Now “it” refers to the mat! Same sentence structure, different meaning based on context.

For a neural network to understand language, it needs a mechanism to figure out which other words are relevant to each word. That mechanism is self-attention.

The Key Insight

Self-attention lets every token in a sequence “look at” every other token and decide:

  • “Who should I pay attention to?”
  • “How much should I pay attention to them?”
  • “What information should I take from them?”

The Q, K, V Framework

Self-attention uses three concepts borrowed from information retrieval:

The Library Analogy

Imagine you’re in a library:

  1. Query (Q): Your question — “I need information about sleeping animals”
  2. Keys (K): Each book’s title/tags — “Cat Behavior”, “Mat Materials”, “Animal Sleep”
  3. Values (V): Each book’s actual content — the detailed information inside

You match your query against all keys to find the most relevant books, then read the content (values) of the matching ones.

Self-attention works the same way:

  • Each token generates a Query: “What am I looking for?”
  • Each token generates a Key: “What information do I have?”
  • Each token generates a Value: “What will I contribute if attended to?”

Why Three Separate Projections?

Why not just use the raw embeddings?

Without projections, the dot product of a vector with itself is always the largest — every token would attend mostly to itself. Separate Q, K, V projections let the model learn:

  • What aspects to query for (Q)
  • What aspects to advertise (K)
  • What information to provide (V)

These can be completely different! A word might query for “what noun am I modifying?” (Q) while advertising “I’m an adjective” (K) and providing “here’s my semantic meaning” (V).

🤔 Quick Check
Why does self-attention use separate Q, K, V projections instead of using the raw token embeddings directly?

Step-by-Step: The Math with Real Numbers

Let’s trace through self-attention with actual numbers. Click “Next” to advance through each computation step:

Self-Attention Step by Step

Step 1/6: Input Embeddings (X)
Input X (3 × 4)
"I"
-1.00
0.74
0.37
-0.98
"love"
-0.34
-0.63
-0.87
0.06
"cats"
0.83
0.02
0.14
0.47

The Formula

The complete self-attention operation in one line:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Let’s break this down:

StepOperationWhat It Does
1Q=XWQQ = XW^QProject each token to a query vector
2K=XWKK = XW^KProject each token to a key vector
3V=XWVV = XW^VProject each token to a value vector
4QKTQK^TCompute similarity scores between all pairs
5÷dk\div \sqrt{d_k}Scale scores to prevent gradient vanishing
6softmaxConvert scores to weights (sum to 1 per row)
7×V\times VWeighted combination of value vectors
def scaled_dot_product_attention(Q, K, V):
    """
    Args:
        Q: Queries (seq_len, d_k)
        K: Keys    (seq_len, d_k)
        V: Values  (seq_len, d_v)
    Returns:
        output: (seq_len, d_v)
        weights: (seq_len, seq_len) attention matrix
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute dot product scores
    scores = Q @ K.T                    # (seq_len, seq_len)
    
    # Step 2: Scale
    scores = scores / np.sqrt(d_k)
    
    # Step 3: Softmax each row
    weights = softmax(scores, axis=-1)  # Each row sums to 1
    
    # Step 4: Weighted sum of values
    output = weights @ V                # (seq_len, d_v)
    
    return output, weights

Why √d_k Scaling?

This is one of the most common “why?” questions about attention. Here’s the intuition:

The Problem

When dkd_k is large, dot products have large variance:

QK=i=1dkqikiQ \cdot K = \sum_{i=1}^{d_k} q_i k_i

If each qi,kiN(0,1)q_i, k_i \sim N(0, 1), then QKQ \cdot K has variance dkd_k.

For dk=64d_k = 64: standard deviation =64=8= \sqrt{64} = 8. Scores can easily be ±20 or more.

The Consequence

Large scores push softmax into saturation — outputs become nearly one-hot:

Without Scaling (dk=64d_k = 64)With Scaling (÷64=8\div \sqrt{64} = 8)
Raw scores[30, 25, -10, 5][3.75, 3.125, -1.25, 0.625]
After softmax[0.993, 0.007, 0.000, 0.000][0.42, 0.32, 0.04, 0.22]
AttentionAlmost all on one token ❌Spread across tokens ✅
GradientsNear-zero (vanishing) ❌Healthy range ✅

Saturated softmax → vanishing gradients → network can’t learn. Scaling by dk\sqrt{d_k} keeps the variance at 1, and softmax stays in its healthy range.

🤔 Quick Check
What happens to the attention distribution if we DON'T scale by √d_k and d_k is large?

Interactive: Explore Attention Patterns

Play with the attention visualizer below. Try different sentences and see how tokens attend to each other:

Self-Attention Visualizer

Click a token to see its attention distribution
Q\K
Attention:
0% 100%
Attention(Q, K, V) = softmax(QKT / √dk)V

Things to try:

  • Toggle Causal Mask to see how decoder-style attention works (each token can only see previous tokens)
  • Increase temperature to see how it softens the attention distribution
  • Click a token to see its complete attention pattern
  • Try sentences with ambiguous words like “bank” to see context effects

Causal (Masked) Self-Attention

For language generation (GPT, LLaMA), we need a critical constraint: a token at position ii should only attend to positions 0,1,,i0, 1, \ldots, i. It can’t see the future!

The Mask

We add -\infty to blocked positions before softmax, so they get zero attention weight:

“The""cat""sat""on"
"The”✅ 0-\infty-\infty-\infty
“cat”✅ 0✅ 0-\infty-\infty
“sat”✅ 0✅ 0✅ 0-\infty
“on”✅ 0✅ 0✅ 0✅ 0

Each row shows what a token can see: ✅ = visible, ❌ = blocked (future).

When to Mask

ArchitectureMaskingWhy
BERT (encoder)No maskUnderstanding needs full context
GPT (decoder)Causal maskCan’t see future during generation
T5 encoderNo maskInput is fully visible
T5 decoderCausal maskOutput is generated left-to-right
✍️ Fill in the Blanks
In causal self-attention, position i can attend to positions 0 through , preventing the model from seeing tokens.
def causal_mask(seq_len):
    # Upper triangle = blocked (-infinity so softmax gives 0)
    mask = np.triu(np.ones((seq_len, seq_len)), k=1)
    return np.where(mask == 1, -np.inf, 0)

Self-Attention vs. Cross-Attention

AspectSelf-AttentionCross-Attention
Q, K, V sourceAll from same sequenceQ from one, K/V from another
Used inEvery transformer blockEncoder-decoder models only
PurposeTokens communicate within a sequenceDecoder reads from encoder
Example”cat” attends to “sat” in same sentenceFrench “chat” attends to English “cat”

Self-attention is the fundamental building block. Cross-attention is an extension used only in encoder-decoder architectures.


Variance of Dot Products

Let qi,kiq_i, k_i be i.i.d. with E[qi]=E[ki]=0E[q_i] = E[k_i] = 0 and Var(qi)=Var(ki)=1\text{Var}(q_i) = \text{Var}(k_i) = 1.

The dot product s=i=1dkqikis = \sum_{i=1}^{d_k} q_i k_i is a sum of dkd_k independent random variables.

For each term: E[qiki]=E[qi]E[ki]=0E[q_i k_i] = E[q_i]E[k_i] = 0 and Var(qiki)=E[qi2ki2](E[qiki])2=110=1\text{Var}(q_i k_i) = E[q_i^2 k_i^2] - (E[q_i k_i])^2 = 1 \cdot 1 - 0 = 1.

By independence: Var(s)=dk1=dk\text{Var}(s) = d_k \cdot 1 = d_k.

Dividing by dk\sqrt{d_k}: Var(s/dk)=dk/dk=1\text{Var}(s / \sqrt{d_k}) = d_k / d_k = 1.

The scores now have unit variance regardless of dkd_k, keeping softmax in its sensitive range.


Key Takeaways

  1. Self-attention lets every token gather information from every other token
  2. Q, K, V projections let the model learn different roles for finding and providing information
  3. √d_k scaling prevents softmax saturation and gradient vanishing
  4. Causal masking prevents tokens from seeing the future (essential for generation)
  5. Self-attention is O(n²) in sequence length — this is both its power and its limitation

Next: Multi-Head Attention

A single attention head gives one perspective. But language is multi-faceted — we need multiple perspectives simultaneously. That’s Multi-Head Attention →