SmoothQuant is one of those papers where understanding the trick saves you from misunderstanding ten other quantization papers. The core insight — equivalence between scaling activations down and scaling weights up — lets you turn 'impossible-to-quantize' activations into 'easy-to-quantize' ones.

Advertisement

The problem with activation outliers

LLM activations have a small number of channels with much larger magnitudes than the rest (often 100x). INT8 per-tensor quantization clips these, destroying quality. Per-channel activation quantization is computationally awkward at runtime.

The mathematical identity

In matrix multiplication Y = X·W, you can divide X by a per-channel scale S and multiply W by S without changing the result: Y = (X/S) · (S·W). The trick: pick S so X/S has smaller outliers (easier to quantize), even though W·S has larger ones (still easy to quantize at quantization-time per-output-channel).

Advertisement

How to pick S

S_j = max(|X_j|)^α / max(|W_j|)^(1-α). α typically 0.5 — splits difficulty between activations and weights. Calibrate with ~512 samples to find max activations per channel.

Why this preserves quality

Mathematical equivalence: identical results in FP. Quantization error is reduced because both X and W now have well-bounded magnitudes. Per-tensor INT8 quantization becomes viable; no special hardware needed.

Where SmoothQuant fits

Pre-applied to weights at quantization time (one-time). Activation scaling fused into LayerNorm's scale parameter (no runtime overhead). Standard step in INT8 quantization pipelines like LLM-Compressor, optimum-intel.

Scale activations down, scale weights up — math identity. Both become quantizable. No runtime cost. Standard preprocessing for INT8.