Long Context Strategies — Belgavi.AI Lab

Long-context LLMs in 2026 handle 128K (Llama 3.1), 1M (Gemini), or more. The naive O(N²) attention doesn't scale. Several tricks make it work: sliding window, sparse attention, ring attention, plus RoPE extensions.

Advertisement

Sliding window attention

# Each token attends only to last W positions (e.g., W=4096)
# Memory: O(N·W) instead of O(N²)
# Compute: O(N·W)

Mistral 7B uses sliding window. Lower quality on cross-window dependencies but enables long sequences. Combined with cache-eviction policies.

Sparse attention (Longformer, BigBird)

Each token attends to a fixed subset: random sample + sliding window + a few 'global' tokens. Linear in N. Used in document QA models. Less common in modern decoder-only LLMs.

Advertisement

Ring attention

For distributed training: split sequence across N devices, pass K, V around the ring. Each device computes its block's attention. Enables training on million-token sequences across a GPU cluster. Used to train Gemini-class models.

Context distillation

Train a model to internalize long context via summary tokens. The 'long context' becomes a short list of summaries. Easier inference but loses fidelity. Used in some agentic workflows.

RAG as a long-context alternative

Don't put 1M tokens in context. Retrieve relevant 4K. Cheaper, often better. Modern agents use this even when long context is technically available. RAG quality varies; ablate carefully.

Sliding window + sparse + ring attention + RoPE extensions. RAG is often a simpler alternative for very long input.