Quantization Layouts in Memory

Quantized weights are not just 'small floats'. They're packed INT4/INT8 values with separate scale (and sometimes zero-point) per group. The layout determines load speed, compute speed, and accuracy. Knowing the formats lets you pick the right quant for your hardware.

Advertisement

Per-tensor vs per-channel vs per-group

Per-tensor: one scale for the whole tensor. Fast, lowest quality. Per-channel: one scale per output row of a weight matrix. Most common for INT8. Per-group: one scale per N elements (e.g., 32, 64, 128). Smaller groups = better quality but more scale storage.

Q4_K layout (GGUF)

# Block of 256 weights, grouped:
#   8 sub-groups of 32 weights each
#   1 fp16 scale per sub-group
#   weights packed as 4-bit integers
#
# Total: 256 * 4 bits + 8 * 16 bits
#      = 128 bytes + 16 bytes = 144 bytes
#      = 4.5 bits/weight on average

Per-32 sub-group scaling captures local variation. K-quants (Q4_K, Q5_K, Q6_K) use clever bit packing to fit metadata cheaply. 4.5 bits/weight final cost; quality close to FP16 on most tasks.

Advertisement

AWQ — activation-aware weight quantization

# 1. Identify 'salient' weight columns (those multiplying large activations)
# 2. Scale them to be larger (less quantization error)
# 3. Scale the corresponding activations down (compensate)
# 4. Quantize the scaled weights to INT4

Production format for vLLM, TensorRT-LLM. Slight quality edge over Q4_K. Per-channel scale + INT4 weights. Specific kernels (Marlin) for fast INT4·FP16 matmul.

GPTQ — second-order quantization

Uses inverse Hessian information to compute quantization error-minimizing perturbations. Heavier calibration than AWQ. Equivalent quality. Less common in 2026 inference — AWQ won.

Loading and compute layout

# Stored: packed INT4 with separate scales
# At inference time:
#   Option A: dequantize to FP16, then matmul (simpler)
#   Option B: fused INT4·FP16 matmul kernel (faster)
#
# Marlin (Ampere GPU): option B, ~2x faster
# llama.cpp (CPU): option B with AVX-512 INT8 tricks

Option A wastes RAM bandwidth on dequant. Option B is what real inference engines use. The kernel writers' job; you just pick a model in the right format for your engine.

Q4_K = blocks + per-group scales, ~4.5 bits/weight. AWQ has scaled weights for INT4. Fused INT4 kernels avoid dequant cost.