Positional Encoding

The Problem: Attention Doesn’t Know Order

Here’s a fundamental issue with self-attention: it’s permutation-invariant.

Consider these two very different sentences:

Dog bites man” → Routine news
Man bites dog” → Breaking news!

If you compute self-attention on both, the attention weights between “dog”, “bites”, and “man” are identical — because attention only cares about what the tokens are, not where they are.

InputAttention between “dog” ↔ “bites”Meaning
”Dog bites man”Same weightsRoutine
”Man bites dog”Same weightsBreaking news!

Self-attention sees the same set of tokens regardless of order — clearly wrong. Word order matters enormously. We need to inject position information.

Why RNNs Don’t Have This Problem

RNNs process tokens sequentially — position is implicit in the computation order. Token 3 is processed after token 2, and the hidden state carries temporal information.

Transformers process all positions in parallel — fast, but position-blind without help.


The Solution: Add Position Vectors

The idea is simple: create a positional encoding vector for each position, and add it to the token embedding:

xinput=token_embedding+positional_encodingx_{\text{input}} = \text{token\_embedding} + \text{positional\_encoding}

Now “dog” at position 0 has a different representation than “dog” at position 2, even though they’re the same word.


Sinusoidal Positional Encoding

The original Transformer paper (2017) used a clever encoding based on sine and cosine waves at different frequencies:

PE(pos,2i)=sin(pos100002i/d)PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d}}\right) PE(pos,2i+1)=cos(pos100002i/d)PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d}}\right)

The Clock Analogy

Think of positional encoding like the hands on a clock:

Clock HandFrequencyPE DimensionsWhat It Captures
Second handHigh (spins fast)Low dimensions (0, 1, 2…)Fine-grained position
Minute handMediumMiddle dimensionsMedium-range structure
Hour handLow (changes slowly)High dimensionsCoarse position

Together, all the hands uniquely identify any moment in time. Similarly, the combination of all sine/cosine waves at different frequencies uniquely identifies each position.

def sinusoidal_positional_encoding(max_len, d_model):
    PE = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]
    
    # Different frequency for each dimension pair
    div_term = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
    
    PE[:, 0::2] = np.sin(position * div_term)  # Even dims: sin
    PE[:, 1::2] = np.cos(position * div_term)  # Odd dims: cos
    
    return PE

Visualize the Encoding

Explore the sinusoidal positional encoding patterns. Switch between heatmap view (see all dimensions at once) and wave view (see how individual dimensions vary with position):

Positional Encoding Visualizer

to
to
Position
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Dimension
-1
+1
PE(pos, 2i) = sin(pos / 100002i/d)     PE(pos, 2i+1) = cos(pos / 100002i/d)

What to notice:

  • Low dimensions (left columns) change rapidly — high-frequency “second hand”
  • High dimensions (right columns) change slowly — low-frequency “hour hand”
  • Each position has a unique combination of values across all dimensions

Why Do Nearby Positions Look Similar?

The key property of sinusoidal encoding: similarity between positions depends on their distance, not their absolute values.

Hover over the matrix below to see how cosine similarity between position vectors decreases smoothly with distance:

PE Cosine Similarity Matrix

Hover cells to see similarity values
pos
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-1 (opposite)
+1 (identical)
Key observation: The diagonal stripe pattern shows that similarity depends mainly on relative distance between positions, not absolute position. Nearby positions are similar; distant ones are less so.

Why This Is Useful

This relative-distance property means:

  • The model can learn that “the word 2 positions to my left is usually a verb”
  • This pattern works the same at position 5 as at position 500
  • The model doesn’t have to relearn positional patterns for every absolute position
🤔 Quick Check
Can sinusoidal positional encoding handle sequences longer than those seen during training?

Comparison: Different Position Encodings

MethodUsed ByKey IdeaExtrapolates?
SinusoidalOriginal TransformerFixed sin/cos functionsYes (in theory)
LearnedBERT, GPT-2Trainable position embeddingsNo (max_len is fixed)
RoPELLaMA, MistralRotate Q and K by positionYes (better than sinusoidal)
ALiBiBLOOMSubtract linear bias from attention scoresYes (strong extrapolation)

Learned Positional Embeddings

BERT and GPT-2 simply learn a (max_len,dmodel)(max\_len, d_{model}) parameter matrix — each position gets a trainable vector.

Pros: Slightly better performance (model optimizes the encoding)
Cons: Can’t handle sequences longer than max_len

class LearnedPE:
    def __init__(self, max_len, d_model):
        self.PE = np.random.randn(max_len, d_model) * 0.01  # Learned!
    
    def forward(self, seq_len):
        return self.PE[:seq_len]

RoPE (Rotary Position Embedding)

The modern favorite. Instead of adding PE to the embedding, RoPE rotates the Q and K vectors based on their position:

qm=qmeimθ,kn=kneinθq'_m = q_m \cdot e^{im\theta}, \quad k'_n = k_n \cdot e^{in\theta}

The dot product then naturally encodes relative position:

qmkn=qmknei(mn)θq'_m \cdot k'_n = q_m \cdot k_n \cdot e^{i(m-n)\theta}

The phase difference (mn)(m-n) captures how far apart the tokens are. This is elegant because relative position information emerges automatically in the attention scores.

Used by: LLaMA, LLaMA 2, Mistral, Gemma, and most modern LLMs.

Advantages Over Sinusoidal and Learned PE

  1. Relative position in attention: RoPE encodes relative position directly in the Q·K dot product, not as an additive offset. This is more natural.

  2. Better extrapolation: With techniques like NTK-aware scaling or YaRN, RoPE can handle sequences 4-8× longer than training.

  3. No extra parameters: Like sinusoidal, RoPE uses fixed mathematical functions (no learned parameters for position).

  4. Compatible with linear attention: The rotation can be efficiently applied.

The Evolution

  • 2017: Sinusoidal (original paper)
  • 2018: Learned (BERT, GPT-2)
  • 2021: RoPE (RoFormer paper)
  • 2022: ALiBi (BLOOM)
  • 2023+: RoPE dominates in open-source LLMs

Key Equations

Sinusoidal

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

Input to Transformer

x=Embedding(token)+PE(position)x = \text{Embedding}(\text{token}) + \text{PE}(\text{position})


Summary

  1. Self-attention is position-blind — we must add position information explicitly
  2. Sinusoidal encoding uses different-frequency waves to create unique position signatures
  3. Nearby positions are similar — the encoding captures relative distance
  4. Modern models use RoPE — it encodes relative position directly in attention scores

Next: Layer Normalization & Residual Connections

The final building blocks for stable training: Layer Norm & Residuals →