▶ Interactive Lab

KV Cache Memory Growth

Watch KV cache memory grow with context length.

Advertisement
KV cache = 2 * L * kv_heads * d_head * ctx * batch * bytes_per_val.

What you're seeing

Long-context KV cache often exceeds model weights. Quantize to FP8 or INT4 for serving.

★ KEY TAKEAWAY
KV cache scales linearly with context length. At long context, it often exceeds model weights — biggest memory cost in serving.
▶ WHAT TO TRY
  • Drag Context from 1K to 128K — see the curve grow.
  • Switch Precision to FP8 or INT4 — instant 2× / 4× memory reduction.
  • This is why production engines (vLLM, TensorRT-LLM) ship FP8 KV cache by default.