Advertisement
KV cache = 2 × layers × kv_heads × head_dim × context × batch × bytes_per_value
What you're seeing
KV cache stores past keys and values for attention reuse. Size scales linearly with context length and batch size. For long-context serving, KV cache often exceeds the model weights.
GQA (Llama 2/3): fewer K/V heads than Q heads — smaller cache. MQA: one K/V head — smallest. Quantizing KV to FP8 or INT4 is the biggest production win for long-context throughput.
★ KEY TAKEAWAY
KV cache scales linearly with context and batch. Often exceeds model weights at long context.
▶ WHAT TO TRY
- Switch between model sizes.
- Drag context to 128K — see the cache grow huge.
- Switch precision to FP8 or INT4 for 2-4× reduction.