LLM quantization went mainstream when GPT-class models needed to fit on consumer GPUs. INT8 vs INT4 vs lower is no longer 'research vs prod'; it's the daily inference decision. The quality/size tradeoff is well-characterized now.
INT8: nearly-free downsizing
Weight + activation quantization to 8-bit. ~0.5% quality drop on benchmarks. ~2x smaller memory, ~1.5-2x faster inference on modern GPUs (Tensor Core int8 paths). The default choice for production inference unless you're memory-constrained.
INT4: aggressive but viable for weights
Weights to 4-bit, activations stay 8 or 16 bit (mixed precision). ~1-3% quality drop with GPTQ/AWQ. ~4x smaller weights. Right for fitting 70B models on 48GB GPUs.
INT3, INT2, binary: not free
Below 4-bit, quality drops sharply. INT2 with QuIP or similar can preserve ~80% performance, but most workloads can't tolerate that drop. Right for research, not production.
Calibration matters more than algorithm
GPTQ, AWQ, SmoothQuant — algorithms differ at the margin. The bigger lever is calibration data: 1024-2048 samples representative of your inference distribution. Wrong calibration = big quality hit regardless of algorithm.
Practical guidance
Production: INT8 if you have GPU memory, INT4 if you don't. Use AWQ or GPTQ as the algorithm. Calibrate on your domain. Validate on your evals, not generic benchmarks — quantization sometimes hurts long-context or reasoning tasks more than benchmark numbers show.