Positional Encoding
The Problem: Attention Doesn’t Know Order
Here’s a fundamental issue with self-attention: it’s permutation-invariant.
Consider these two very different sentences:
“Dog bites man” → Routine news
”Man bites dog” → Breaking news!
If you compute self-attention on both, the attention weights between “dog”, “bites”, and “man” are identical — because attention only cares about what the tokens are, not where they are.
| Input | Attention between “dog” ↔ “bites” | Meaning |
|---|---|---|
| ”Dog bites man” | Same weights | Routine |
| ”Man bites dog” | Same weights | Breaking news! |
Self-attention sees the same set of tokens regardless of order — clearly wrong. Word order matters enormously. We need to inject position information.
Why RNNs Don’t Have This Problem
RNNs process tokens sequentially — position is implicit in the computation order. Token 3 is processed after token 2, and the hidden state carries temporal information.
Transformers process all positions in parallel — fast, but position-blind without help.
The Solution: Add Position Vectors
The idea is simple: create a positional encoding vector for each position, and add it to the token embedding:
Now “dog” at position 0 has a different representation than “dog” at position 2, even though they’re the same word.
Sinusoidal Positional Encoding
The original Transformer paper (2017) used a clever encoding based on sine and cosine waves at different frequencies:
The Clock Analogy
Think of positional encoding like the hands on a clock:
| Clock Hand | Frequency | PE Dimensions | What It Captures |
|---|---|---|---|
| Second hand | High (spins fast) | Low dimensions (0, 1, 2…) | Fine-grained position |
| Minute hand | Medium | Middle dimensions | Medium-range structure |
| Hour hand | Low (changes slowly) | High dimensions | Coarse position |
Together, all the hands uniquely identify any moment in time. Similarly, the combination of all sine/cosine waves at different frequencies uniquely identifies each position.
def sinusoidal_positional_encoding(max_len, d_model):
PE = np.zeros((max_len, d_model))
position = np.arange(max_len)[:, np.newaxis]
# Different frequency for each dimension pair
div_term = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
PE[:, 0::2] = np.sin(position * div_term) # Even dims: sin
PE[:, 1::2] = np.cos(position * div_term) # Odd dims: cos
return PEVisualize the Encoding
Explore the sinusoidal positional encoding patterns. Switch between heatmap view (see all dimensions at once) and wave view (see how individual dimensions vary with position):
Positional Encoding Visualizer
What to notice:
- Low dimensions (left columns) change rapidly — high-frequency “second hand”
- High dimensions (right columns) change slowly — low-frequency “hour hand”
- Each position has a unique combination of values across all dimensions
Why Do Nearby Positions Look Similar?
The key property of sinusoidal encoding: similarity between positions depends on their distance, not their absolute values.
Hover over the matrix below to see how cosine similarity between position vectors decreases smoothly with distance:
PE Cosine Similarity Matrix
Why This Is Useful
This relative-distance property means:
- The model can learn that “the word 2 positions to my left is usually a verb”
- This pattern works the same at position 5 as at position 500
- The model doesn’t have to relearn positional patterns for every absolute position
Comparison: Different Position Encodings
| Method | Used By | Key Idea | Extrapolates? |
|---|---|---|---|
| Sinusoidal | Original Transformer | Fixed sin/cos functions | Yes (in theory) |
| Learned | BERT, GPT-2 | Trainable position embeddings | No (max_len is fixed) |
| RoPE | LLaMA, Mistral | Rotate Q and K by position | Yes (better than sinusoidal) |
| ALiBi | BLOOM | Subtract linear bias from attention scores | Yes (strong extrapolation) |
Learned Positional Embeddings
BERT and GPT-2 simply learn a parameter matrix — each position gets a trainable vector.
Pros: Slightly better performance (model optimizes the encoding)
Cons: Can’t handle sequences longer than max_len
class LearnedPE:
def __init__(self, max_len, d_model):
self.PE = np.random.randn(max_len, d_model) * 0.01 # Learned!
def forward(self, seq_len):
return self.PE[:seq_len]RoPE (Rotary Position Embedding)
The modern favorite. Instead of adding PE to the embedding, RoPE rotates the Q and K vectors based on their position:
The dot product then naturally encodes relative position:
The phase difference captures how far apart the tokens are. This is elegant because relative position information emerges automatically in the attention scores.
Used by: LLaMA, LLaMA 2, Mistral, Gemma, and most modern LLMs.
Advantages Over Sinusoidal and Learned PE
-
Relative position in attention: RoPE encodes relative position directly in the Q·K dot product, not as an additive offset. This is more natural.
-
Better extrapolation: With techniques like NTK-aware scaling or YaRN, RoPE can handle sequences 4-8× longer than training.
-
No extra parameters: Like sinusoidal, RoPE uses fixed mathematical functions (no learned parameters for position).
-
Compatible with linear attention: The rotation can be efficiently applied.
The Evolution
- 2017: Sinusoidal (original paper)
- 2018: Learned (BERT, GPT-2)
- 2021: RoPE (RoFormer paper)
- 2022: ALiBi (BLOOM)
- 2023+: RoPE dominates in open-source LLMs
Key Equations
Sinusoidal
Input to Transformer
Summary
- Self-attention is position-blind — we must add position information explicitly
- Sinusoidal encoding uses different-frequency waves to create unique position signatures
- Nearby positions are similar — the encoding captures relative distance
- Modern models use RoPE — it encodes relative position directly in attention scores
Next: Layer Normalization & Residual Connections
The final building blocks for stable training: Layer Norm & Residuals →