INT4 quantization saves memory; INT4 inference saves time only if the GPU can multiply INT4 efficiently. The kernel story has improved dramatically: 2023's 'INT4 inference is just dequantize-then-FP16-matmul' is increasingly replaced by native INT4 matmul on supported hardware.
Dequantize-on-load (the old way)
Weights stored as INT4. Loaded into SRAM, dequantized to FP16, then FP16 matmul. Saves memory bandwidth (good) but not compute. ~2-3x speedup for memory-bound layers; little speedup for compute-bound.
Native INT4 tensor cores
NVIDIA Blackwell (B100/B200) has true INT4 tensor cores. AMD MI350+ similar. ~4x throughput vs FP16 on supported ops. The full INT4 speed-up finally arrives on hardware.
Marlin / GPTQ-Marlin kernels
Highly-optimized open kernels for INT4 matmul on Ampere/Hopper. Compiles to specific tile sizes. ~30-50% faster than naive dequantize-then-matmul on the same hardware. Standard in vLLM and SGLang.
Quantization format matters
Per-channel scale: easy on hardware, slight quality cost. Group-wise (per-128 channels): better quality, more compute. AWQ's reordering plus group-wise + Marlin = current quality/speed sweet spot.
What this means for serving
Pre-2024: INT4 was a memory-fit hack. 2025+: INT4 is the cost-optimal inference path for many workloads. Re-eval your serving stack annually — kernel improvements move the picture.