Long-context LLMs in 2026 handle 128K (Llama 3.1), 1M (Gemini), or more. The naive O(N²) attention doesn't scale. Several tricks make it work: sliding window, sparse attention, ring attention, plus RoPE extensions.
Sliding window attention
# Each token attends only to last W positions (e.g., W=4096)
# Memory: O(N·W) instead of O(N²)
# Compute: O(N·W)Mistral 7B uses sliding window. Lower quality on cross-window dependencies but enables long sequences. Combined with cache-eviction policies.
Sparse attention (Longformer, BigBird)
Each token attends to a fixed subset: random sample + sliding window + a few 'global' tokens. Linear in N. Used in document QA models. Less common in modern decoder-only LLMs.
Ring attention
For distributed training: split sequence across N devices, pass K, V around the ring. Each device computes its block's attention. Enables training on million-token sequences across a GPU cluster. Used to train Gemini-class models.
Context distillation
Train a model to internalize long context via summary tokens. The 'long context' becomes a short list of summaries. Easier inference but loses fidelity. Used in some agentic workflows.
RAG as a long-context alternative
Don't put 1M tokens in context. Retrieve relevant 4K. Cheaper, often better. Modern agents use this even when long context is technically available. RAG quality varies; ablate carefully.