Long-context models (1M+ tokens) didn't eliminate context management — they raised its leverage. Stuffing 500K tokens because you can is wasteful and produces worse results than disciplined context curation.

Advertisement

Token budget is real even at 1M

Cost scales linearly with input. Latency scales sub-linearly. Quality degrades on lost-in-the-middle pattern even on long-context models. Plan budgets per use case: 4K, 32K, 128K — pick conservative.

Hierarchical summarization

Recent turns verbatim, older turns summarized, even older summarized into a paragraph. Keep critical facts in a separate 'memory' section. Lossy but practical.

Advertisement

Retrieval over stuffing

Index source docs, retrieve relevant chunks per turn. Beats 'put everything in context' on both cost and quality for most workloads >32K tokens.

Even on 1M context, budget. Hierarchical summary for chat. Retrieval beats stuffing for docs.