▶ Interactive Lab

KV Cache Memory Calculator

See how KV cache grows with context length, batch size, and precision.

Advertisement
KV cache = 2 × layers × kv_heads × head_dim × context × batch × bytes_per_value

What you're seeing

KV cache stores past keys and values for attention reuse. Size scales linearly with context length and batch size. For long-context serving, KV cache often exceeds the model weights.

GQA (Llama 2/3): fewer K/V heads than Q heads — smaller cache. MQA: one K/V head — smallest. Quantizing KV to FP8 or INT4 is the biggest production win for long-context throughput.

★ KEY TAKEAWAY
KV cache scales linearly with context and batch. Often exceeds model weights at long context.
▶ WHAT TO TRY
  • Switch between model sizes.
  • Drag context to 128K — see the cache grow huge.
  • Switch precision to FP8 or INT4 for 2-4× reduction.