Long-context models (1M+ tokens) didn't eliminate context management — they raised its leverage. Stuffing 500K tokens because you can is wasteful and produces worse results than disciplined context curation.
Advertisement
Token budget is real even at 1M
Cost scales linearly with input. Latency scales sub-linearly. Quality degrades on lost-in-the-middle pattern even on long-context models. Plan budgets per use case: 4K, 32K, 128K — pick conservative.
Hierarchical summarization
Recent turns verbatim, older turns summarized, even older summarized into a paragraph. Keep critical facts in a separate 'memory' section. Lossy but practical.
Advertisement
Retrieval over stuffing
Index source docs, retrieve relevant chunks per turn. Beats 'put everything in context' on both cost and quality for most workloads >32K tokens.
Even on 1M context, budget. Hierarchical summary for chat. Retrieval beats stuffing for docs.