Three libraries dominate open-source INT4 LLM quantization in 2026. bitsandbytes (Tim Dettmers), AutoAWQ, AutoGPTQ. They make different trade-offs in quality, speed, and deployment story. Knowing which one fits your stack avoids the 'we tried INT4 and it was worse' anti-result.
bitsandbytes — easy QLoRA
Used by Hugging Face Transformers via load_in_4bit. NF4 quantization. Native PyTorch ops; works on Ampere+. Best for: QLoRA fine-tuning, exploratory work. Worst for: production serving (slower than AWQ/GPTQ at inference).
AutoAWQ — production inference
AWQ algorithm with optimized kernels (Marlin on Ampere+). Best-in-class inference speed in 2026. Supported in vLLM, SGLang, TensorRT-LLM. Best for: production deployment. Slight quality lead over GPTQ on most benchmarks.
AutoGPTQ — mature, broad model support
GPTQ algorithm. Slightly slower than AWQ at inference. Broader model architecture coverage (handles unusual architectures more reliably). Mature; first to support new model releases.
Picking
Fine-tuning: bitsandbytes (QLoRA recipe). Serving the result: AutoAWQ (recompile to AWQ format for production). New model just released: AutoGPTQ first, AutoAWQ when caught up. Memory-only constraint: any of the three.
Common gotchas
Quantizing on one machine, serving on another with different hardware: kernel compatibility. Calibration data choice changes quality more than algorithm. AWQ models can't be merged with adapters without dequantization first.