Sinusoidal position embeddings (original Transformer) added position info before attention. RoPE multiplies it in during attention — better extrapolation, easier context-length extension, simpler implementation. Every modern open LLM uses RoPE.
The basic idea
For each query/key vector, rotate by an angle proportional to position. Different angles for different dimensions (high freq for nearby positions, low freq for far). The dot product naturally encodes relative position.
Why it's better than learned absolute
Relative position info is what matters for attention. RoPE encodes it inherently; learned absolute position has to relearn this each time. Extrapolation beyond training length works much better.
Long-context tricks
YaRN, dynamic NTK scaling, position interpolation — all are variants of 'change the RoPE base frequency' to extend trained context. Why a 8K-trained model can be extended to 32K with minor tuning.