A model trained on 4K context can be extended to 32K, 128K, even 1M with the right tricks. All of them adjust RoPE's frequency base. The math is short; the empirical recipes are what changed the long-context game.
The problem
RoPE frequencies θ_i = 10000^(-2i/d) are tuned for the training context length. At positions beyond training, high-frequency dimensions wrap many times → angle distribution is unfamiliar. Model performance drops sharply outside trained range.
Linear interpolation (Position Interpolation)
# Original: θ_i applied to pos directly
# PI: scale pos by ratio so max_pos maps to trained_max
#
# For 4K→16K extension (4x):
# effective_pos = real_pos / 4Squashes positions into the trained range. Simple. Quality drops for very long contexts. Used in early Llama extensions.
NTK-aware scaling
# Don't just scale all dimensions equally
# Scale only the high-frequency dims; leave low untouched
# Mathematically: increase the base from 10000 to higher value
#
# base' = 10000 * α^(d/(d-2)) where α = new_max / orig_maxPreserves the low-frequency (long-range) information, adjusts high frequencies. Better quality than pure interpolation. Used in many open extensions.
YaRN
Combines NTK-aware scaling with a temperature-style adjustment on attention. Empirically best open recipe for 4×-32× extensions. Used in Yarn-Mistral, Yarn-Llama. Few hundred fine-tuning steps to adapt the model.
LongRope (Phi-3)
Microsoft's approach for Phi-3: optimize per-dimension scaling factors via search. Works up to 128K context. The scaling parameters are learned/searched offline, then frozen in the model config. Best quality at extreme lengths.