KV Cache Memory Calculator

Advertisement

Model Context 32768 Batch Precision

KV cache = 2 × layers × kv_heads × head_dim × context × batch × bytes_per_value

What you're seeing

KV cache stores past keys and values for attention reuse. Size scales linearly with context length and batch size. For long-context serving, KV cache often exceeds the model weights.

GQA (Llama 2/3): fewer K/V heads than Q heads — smaller cache. MQA: one K/V head — smallest. Quantizing KV to FP8 or INT4 is the biggest production win for long-context throughput.

★ KEY TAKEAWAY

KV cache scales linearly with context and batch. Often exceeds model weights at long context.

▶ WHAT TO TRY

Switch between model sizes.
Drag context to 128K — see the cache grow huge.
Switch precision to FP8 or INT4 for 2-4× reduction.