The original Transformer used additive sinusoidal positional encoding. Modern models (Llama, Mistral, Qwen) use RoPE — a multiplicative rotation in complex space. RoPE generalizes better to longer sequences than training distribution and is the foundation of long-context techniques.
How it works
For each pair of dimensions in Q/K, RoPE rotates them by an angle proportional to the token's position. The dot product Q·K then encodes RELATIVE position automatically. No additive embedding; the position information is baked into the attention computation itself.
Why it scales
Sinusoidal PE was an addition to the embedding — meaning positions outside training data simply weren't represented. RoPE's rotation generalizes: a token at position 8000 in a model trained to 4000 still has a sensible (if not perfect) representation.
Length extrapolation tricks
Even RoPE struggles beyond ~2x training context. NTK scaling: scale the rotation base frequency to compress the longer sequence into the trained range. YaRN: smoother scaling with per-dimension behavior. Position interpolation: linear scale positions. These let you fine-tune a 4K-context model to 32K+ with minimal training.
Implementation
def apply_rope(q, k, cos, sin):
# q, k shape: [batch, heads, seq, dim]
q_rot = (q * cos) + (rotate_half(q) * sin)
k_rot = (k * cos) + (rotate_half(k) * sin)
return q_rot, k_rot
def rotate_half(x):
x1, x2 = x.chunk(2, dim=-1)
return torch.cat([-x2, x1], dim=-1)Why it won
Relative position handling, extrapolation, simple implementation, no learnable parameters, plays well with FlashAttention. The combination is unbeatable for 2026-era models.