bitsandbytes vs AutoAWQ vs AutoGPTQ

Three libraries dominate open-source INT4 LLM quantization in 2026. bitsandbytes (Tim Dettmers), AutoAWQ, AutoGPTQ. They make different trade-offs in quality, speed, and deployment story. Knowing which one fits your stack avoids the 'we tried INT4 and it was worse' anti-result.

Advertisement

bitsandbytes — easy QLoRA

Used by Hugging Face Transformers via load_in_4bit. NF4 quantization. Native PyTorch ops; works on Ampere+. Best for: QLoRA fine-tuning, exploratory work. Worst for: production serving (slower than AWQ/GPTQ at inference).

AutoAWQ — production inference

AWQ algorithm with optimized kernels (Marlin on Ampere+). Best-in-class inference speed in 2026. Supported in vLLM, SGLang, TensorRT-LLM. Best for: production deployment. Slight quality lead over GPTQ on most benchmarks.

Advertisement

AutoGPTQ — mature, broad model support

GPTQ algorithm. Slightly slower than AWQ at inference. Broader model architecture coverage (handles unusual architectures more reliably). Mature; first to support new model releases.

Picking

Fine-tuning: bitsandbytes (QLoRA recipe). Serving the result: AutoAWQ (recompile to AWQ format for production). New model just released: AutoGPTQ first, AutoAWQ when caught up. Memory-only constraint: any of the three.

Common gotchas

Quantizing on one machine, serving on another with different hardware: kernel compatibility. Calibration data choice changes quality more than algorithm. AWQ models can't be merged with adapters without dequantization first.

bitsandbytes for fine-tune, AutoAWQ for serve, AutoGPTQ for broad model support. Calibration data is the bigger quality lever than algorithm.