Positional Encoding

The Problem: Attention Doesn’t Know Order

Here’s a fundamental issue with self-attention: it’s permutation-invariant.

Consider these two very different sentences:

“Dog bites man” → Routine news
”Man bites dog” → Breaking news!

If you compute self-attention on both, the attention weights between “dog”, “bites”, and “man” are identical — because attention only cares about what the tokens are, not where they are.

Input	Attention between “dog” ↔ “bites”	Meaning
”Dog bites man”	Same weights	Routine
”Man bites dog”	Same weights	Breaking news!

Self-attention sees the same set of tokens regardless of order — clearly wrong. Word order matters enormously. We need to inject position information.

Why RNNs Don’t Have This Problem

RNNs process tokens sequentially — position is implicit in the computation order. Token 3 is processed after token 2, and the hidden state carries temporal information.

Transformers process all positions in parallel — fast, but position-blind without help.

The Solution: Add Position Vectors

The idea is simple: create a positional encoding vector for each position, and add it to the token embedding:

$x_{\text{input}} = \text{token\_embedding} + \text{positional\_encoding}$

Now “dog” at position 0 has a different representation than “dog” at position 2, even though they’re the same word.

Sinusoidal Positional Encoding

The original Transformer paper (2017) used a clever encoding based on sine and cosine waves at different frequencies:

$PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d}}\right)$ $PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d}}\right)$

The Clock Analogy

Think of positional encoding like the hands on a clock:

Clock Hand	Frequency	PE Dimensions	What It Captures
Second hand	High (spins fast)	Low dimensions (0, 1, 2…)	Fine-grained position
Minute hand	Medium	Middle dimensions	Medium-range structure
Hour hand	Low (changes slowly)	High dimensions	Coarse position

Together, all the hands uniquely identify any moment in time. Similarly, the combination of all sine/cosine waves at different frequencies uniquely identifies each position.

def sinusoidal_positional_encoding(max_len, d_model):
    PE = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]
    
    # Different frequency for each dimension pair
    div_term = np.exp(np.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
    
    PE[:, 0::2] = np.sin(position * div_term)  # Even dims: sin
    PE[:, 1::2] = np.cos(position * div_term)  # Odd dims: cos
    
    return PE

Visualize the Encoding

Explore the sinusoidal positional encoding patterns. Switch between heatmap view (see all dimensions at once) and wave view (see how individual dimensions vary with position):

Positional Encoding Visualizer

Position range: 0 - 30

Dimension range: 0 - 32

Position

Dimension

-1

PE_{(pos, 2i)} = sin(pos / 10000^2i/d) PE_{(pos, 2i+1)} = cos(pos / 10000^2i/d)

What to notice:

Low dimensions (left columns) change rapidly — high-frequency “second hand”
High dimensions (right columns) change slowly — low-frequency “hour hand”
Each position has a unique combination of values across all dimensions

Why Do Nearby Positions Look Similar?

The key property of sinusoidal encoding: similarity between positions depends on their distance, not their absolute values.

Hover over the matrix below to see how cosine similarity between position vectors decreases smoothly with distance:

PE Cosine Similarity Matrix

Hover cells to see similarity values

pos

-1 (opposite)

+1 (identical)

Key observation: The diagonal stripe pattern shows that similarity depends mainly on relative distance between positions, not absolute position. Nearby positions are similar; distant ones are less so.

Why This Is Useful

This relative-distance property means:

The model can learn that “the word 2 positions to my left is usually a verb”
This pattern works the same at position 5 as at position 500
The model doesn’t have to relearn positional patterns for every absolute position

🤔 Quick Check

Can sinusoidal positional encoding handle sequences longer than those seen during training?

Comparison: Different Position Encodings

Method	Used By	Key Idea	Extrapolates?
Sinusoidal	Original Transformer	Fixed sin/cos functions	Yes (in theory)
Learned	BERT, GPT-2	Trainable position embeddings	No (max_len is fixed)
RoPE	LLaMA, Mistral	Rotate Q and K by position	Yes (better than sinusoidal)
ALiBi	BLOOM	Subtract linear bias from attention scores	Yes (strong extrapolation)

Learned Positional Embeddings

BERT and GPT-2 simply learn a $(max\_len, d_{model})$ parameter matrix — each position gets a trainable vector.

Pros: Slightly better performance (model optimizes the encoding)
Cons: Can’t handle sequences longer than max_len

class LearnedPE:
    def __init__(self, max_len, d_model):
        self.PE = np.random.randn(max_len, d_model) * 0.01  # Learned!
    
    def forward(self, seq_len):
        return self.PE[:seq_len]

RoPE (Rotary Position Embedding)

The modern favorite. Instead of adding PE to the embedding, RoPE rotates the Q and K vectors based on their position:

$q'_m = q_m \cdot e^{im\theta}, \quad k'_n = k_n \cdot e^{in\theta}$

The dot product then naturally encodes relative position:

$q'_m \cdot k'_n = q_m \cdot k_n \cdot e^{i(m-n)\theta}$

The phase difference $(m-n)$ captures how far apart the tokens are. This is elegant because relative position information emerges automatically in the attention scores.

Used by: LLaMA, LLaMA 2, Mistral, Gemma, and most modern LLMs.

Advantages Over Sinusoidal and Learned PE

Relative position in attention: RoPE encodes relative position directly in the Q·K dot product, not as an additive offset. This is more natural.
Better extrapolation: With techniques like NTK-aware scaling or YaRN, RoPE can handle sequences 4-8× longer than training.
No extra parameters: Like sinusoidal, RoPE uses fixed mathematical functions (no learned parameters for position).
Compatible with linear attention: The rotation can be efficiently applied.

The Evolution

2017: Sinusoidal (original paper)
2018: Learned (BERT, GPT-2)
2021: RoPE (RoFormer paper)
2022: ALiBi (BLOOM)
2023+: RoPE dominates in open-source LLMs

Key Equations

Sinusoidal

$PE_{(\text{pos}, 2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)$ $PE_{(\text{pos}, 2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)$

Input to Transformer

$x = \text{Embedding}(\text{token}) + \text{PE}(\text{position})$

Summary

Self-attention is position-blind — we must add position information explicitly
Sinusoidal encoding uses different-frequency waves to create unique position signatures
Nearby positions are similar — the encoding captures relative distance
Modern models use RoPE — it encodes relative position directly in attention scores

Next: Layer Normalization & Residual Connections

The final building blocks for stable training: Layer Norm & Residuals →