AWQ (Activation-aware Weight Quantization) and GPTQ are the two dominant algorithms for post-training 4-bit quantization. Both are excellent; they fit different workloads. The 2026 picture is clearer than when they were both new.
GPTQ — column-wise greedy
Quantizes weights column-by-column, propagating reconstruction error to later columns. Calibration-data-driven; ~1-2 hour run for 70B. Good quality, broad tooling support, mature ecosystem.
AWQ — protect important channels
Identifies salient activation channels and scales weights to protect them before quantization. Faster to apply (~30 min on 70B). Slightly better quality than GPTQ on most benchmarks.
Quality comparison
On standard LLM benchmarks: AWQ ~0.5% better than GPTQ on average, with some tasks (long-context, reasoning) showing larger gaps. Both are within 2% of FP16 baseline.
Inference speed
AWQ has faster inference kernels (no per-output-channel scale lookup). On GPU, AWQ is 10-30% faster. For latency-critical serving, AWQ is the better default in 2026.
Tooling notes
vLLM, TensorRT-LLM, lmdeploy all support both. HuggingFace's optimum library wraps both. Calibration: use your domain data, not C4. ~1000 samples is enough.