Quantization for Attention

Attention is the part of a transformer that resists simple quantization. KV cache is the largest memory cost; quantizing it is essential at scale. Attention scores need numerical care. The recipes that work are specific.

Advertisement

KV cache memory dominates

At long context, KV cache memory exceeds model weights. Llama 3 70B with 32K context, batch 8: ~40GB for KV cache, ~35GB for weights at FP16. KV cache quantization saves more than weight quantization at long context.

FP8 KV cache

Standard in vLLM 0.6+, TensorRT-LLM. Stores K and V in FP8. ~2x memory reduction vs FP16. Quality drop usually <1% on benchmarks. The single biggest production win on long-context serving.

Advertisement

INT4 KV cache

More aggressive. ~4x memory reduction. Quality drop 1-3% typically. Suitable for memory-constrained edge inference. Per-channel scaling helps quality recover.

Why attention scores can't go too low

Softmax over scores produces probabilities; small precision errors get amplified through exp(). Attention computation usually runs in BF16 or FP32 even when weights and activations are lower precision. Don't quantize the score computation.

Sliding window attention

Phi, Mistral: sliding window limits attention to last N tokens. KV cache for window-only positions; much smaller cache. Combines well with quantization for very long contexts on small hardware.

FP8 KV cache as default. INT4 for memory-constrained. Score computation stays high-precision. Sliding window helps cache size.

KV cache memory dominates

FP8 KV cache

INT4 KV cache

Why attention scores can&#x27;t go too low

Sliding window attention

Why attention scores can't go too low