Attention is the part of a transformer that resists simple quantization. KV cache is the largest memory cost; quantizing it is essential at scale. Attention scores need numerical care. The recipes that work are specific.
KV cache memory dominates
At long context, KV cache memory exceeds model weights. Llama 3 70B with 32K context, batch 8: ~40GB for KV cache, ~35GB for weights at FP16. KV cache quantization saves more than weight quantization at long context.
FP8 KV cache
Standard in vLLM 0.6+, TensorRT-LLM. Stores K and V in FP8. ~2x memory reduction vs FP16. Quality drop usually <1% on benchmarks. The single biggest production win on long-context serving.
INT4 KV cache
More aggressive. ~4x memory reduction. Quality drop 1-3% typically. Suitable for memory-constrained edge inference. Per-channel scaling helps quality recover.
Why attention scores can't go too low
Softmax over scores produces probabilities; small precision errors get amplified through exp(). Attention computation usually runs in BF16 or FP32 even when weights and activations are lower precision. Don't quantize the score computation.
Sliding window attention
Phi, Mistral: sliding window limits attention to last N tokens. KV cache for window-only positions; much smaller cache. Combines well with quantization for very long contexts on small hardware.