Modern LLM inference is not 'one precision throughout'. Different layers and tensors use different precisions to balance speed, memory, and quality. Knowing what runs at what precision helps you reason about quality regressions and pick deployment configurations.
The typical mix
Weights: INT4 (most params). KV cache: FP8 (compressing the dominant memory cost). Activations: FP8 or BF16 (depending on stack). Output projection: BF16 (numerical stability). Logits: FP32 (for sampling stability). Five precisions in one forward pass; standard.
Why not all one precision
All-INT4 hurts quality on certain layers (especially output, certain attention paths). All-FP16 wastes memory on quantizable weights. The mixed precision is the cost-quality optimum; pure precisions are corners.
Stack support
vLLM 0.6+, SGLang, TensorRT-LLM all do mixed precision automatically per layer. You configure target weight precision; they pick the right activation precision per op. Manual fine-tuning rarely needed.
Hardware constraints
Hopper (H100/H200): FP8 tensor cores. INT4 via Marlin kernels. Ampere (A100): FP16/BF16 tensor cores, INT4 via dequantize-then-FP16. Blackwell (B100/B200): native FP4. Pick precisions your hardware accelerates; mismatch wastes performance.
Debugging
If a quantized model is producing garbage: bisect — which layer's precision change broke it. Tools (NVIDIA's nv-cusparselt, llm-compressor's per-layer eval) make this manageable. Don't blame the whole quantization recipe; one layer often is the culprit.