GPU Kernels for INT4 Inference

INT4 quantization saves memory; INT4 inference saves time only if the GPU can multiply INT4 efficiently. The kernel story has improved dramatically: 2023's 'INT4 inference is just dequantize-then-FP16-matmul' is increasingly replaced by native INT4 matmul on supported hardware.

Advertisement

Dequantize-on-load (the old way)

Weights stored as INT4. Loaded into SRAM, dequantized to FP16, then FP16 matmul. Saves memory bandwidth (good) but not compute. ~2-3x speedup for memory-bound layers; little speedup for compute-bound.

Native INT4 tensor cores

NVIDIA Blackwell (B100/B200) has true INT4 tensor cores. AMD MI350+ similar. ~4x throughput vs FP16 on supported ops. The full INT4 speed-up finally arrives on hardware.

Advertisement

Marlin / GPTQ-Marlin kernels

Highly-optimized open kernels for INT4 matmul on Ampere/Hopper. Compiles to specific tile sizes. ~30-50% faster than naive dequantize-then-matmul on the same hardware. Standard in vLLM and SGLang.

Quantization format matters

Per-channel scale: easy on hardware, slight quality cost. Group-wise (per-128 channels): better quality, more compute. AWQ's reordering plus group-wise + Marlin = current quality/speed sweet spot.

What this means for serving

Pre-2024: INT4 was a memory-fit hack. 2025+: INT4 is the cost-optimal inference path for many workloads. Re-eval your serving stack annually — kernel improvements move the picture.

Marlin kernels on Ampere/Hopper, native INT4 on Blackwell+. INT4 is no longer just memory savings; it's the speed default.