At long context, the KV cache often exceeds model weights. Quantizing it is one of the highest-leverage inference optimizations. The trade-off: quality vs memory vs throughput. Modern engines support FP8 and INT4 cache.
Why it's huge
# Llama 3 70B at seq=32K, batch=1, FP16:
# Per layer: 2 (K+V) * 8 (KV heads) * 128 (d_head) * 32K * 2 bytes = 32 MB
# 80 layers: 2.5 GB just for KV cache
# At batch=8: 20 GBAlready at moderate batch and context, KV cache rivals or exceeds the model weights. Quantizing it relaxes the memory constraint for serving.
FP8 cache (vLLM, TensorRT-LLM)
# Store K and V as FP8 (E4M3 or E5M2)
# 2x memory reduction vs FP16
# Quality drop: typically <0.5% on benchmarksStandard production quantization for KV cache in 2026. Hardware support on H100+ and recent CPUs (AMX). Trivial code change in vLLM via --kv-cache-dtype fp8.
INT4 cache (llama.cpp)
# Store K and V as INT4 with per-block scaling
# 4x memory reduction
# Quality drop: 1-3% on benchmarks, sometimes moreMore aggressive. Used in llama.cpp for memory-constrained CPU inference. Quality varies by task; long-context coherence sometimes degrades. Verify on your workload.
Per-channel vs per-tensor
Per-tensor: one scale for the whole cache. Fast but inaccurate. Per-channel (per head_dim): more accurate, slightly more storage for scales. Modern engines use per-channel for INT4 cache.
When to skip cache quant
Short-context use cases (<2K) where cache is small. Quality-critical tasks where you can't afford 1% drop. Memory-rich systems where compute (not memory) is the bottleneck. Quantize cache when long-context or memory-constrained.