The original Transformer used additive sinusoidal positional encoding. Modern models (Llama, Mistral, Qwen) use RoPE — a multiplicative rotation in complex space. RoPE generalizes better to longer sequences than training distribution and is the foundation of long-context techniques.

Advertisement

How it works

For each pair of dimensions in Q/K, RoPE rotates them by an angle proportional to the token's position. The dot product Q·K then encodes RELATIVE position automatically. No additive embedding; the position information is baked into the attention computation itself.

Why it scales

Sinusoidal PE was an addition to the embedding — meaning positions outside training data simply weren't represented. RoPE's rotation generalizes: a token at position 8000 in a model trained to 4000 still has a sensible (if not perfect) representation.

Advertisement

Length extrapolation tricks

Even RoPE struggles beyond ~2x training context. NTK scaling: scale the rotation base frequency to compress the longer sequence into the trained range. YaRN: smoother scaling with per-dimension behavior. Position interpolation: linear scale positions. These let you fine-tune a 4K-context model to 32K+ with minimal training.

Implementation

def apply_rope(q, k, cos, sin):
    # q, k shape: [batch, heads, seq, dim]
    q_rot = (q * cos) + (rotate_half(q) * sin)
    k_rot = (k * cos) + (rotate_half(k) * sin)
    return q_rot, k_rot

def rotate_half(x):
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat([-x2, x1], dim=-1)

Why it won

Relative position handling, extrapolation, simple implementation, no learnable parameters, plays well with FlashAttention. The combination is unbeatable for 2026-era models.

RoPE rotates Q/K by position. Enables long-context extrapolation. NTK/YaRN extend further with minimal fine-tuning.