A model trained on 4K context can be extended to 32K, 128K, even 1M with the right tricks. All of them adjust RoPE's frequency base. The math is short; the empirical recipes are what changed the long-context game.

Advertisement

The problem

RoPE frequencies θ_i = 10000^(-2i/d) are tuned for the training context length. At positions beyond training, high-frequency dimensions wrap many times → angle distribution is unfamiliar. Model performance drops sharply outside trained range.

Linear interpolation (Position Interpolation)

# Original: θ_i applied to pos directly
# PI: scale pos by ratio so max_pos maps to trained_max
#
# For 4K→16K extension (4x):
# effective_pos = real_pos / 4

Squashes positions into the trained range. Simple. Quality drops for very long contexts. Used in early Llama extensions.

Advertisement

NTK-aware scaling

# Don't just scale all dimensions equally
# Scale only the high-frequency dims; leave low untouched
# Mathematically: increase the base from 10000 to higher value
#
# base' = 10000 * α^(d/(d-2))    where α = new_max / orig_max

Preserves the low-frequency (long-range) information, adjusts high frequencies. Better quality than pure interpolation. Used in many open extensions.

YaRN

Combines NTK-aware scaling with a temperature-style adjustment on attention. Empirically best open recipe for 4×-32× extensions. Used in Yarn-Mistral, Yarn-Llama. Few hundred fine-tuning steps to adapt the model.

LongRope (Phi-3)

Microsoft's approach for Phi-3: optimize per-dimension scaling factors via search. Works up to 128K context. The scaling parameters are learned/searched offline, then frozen in the model config. Best quality at extreme lengths.

Extend RoPE context by scaling its frequency base. YaRN is the open default. LongRope (Phi-3) is best for extreme lengths.