SmoothQuant is one of those papers where understanding the trick saves you from misunderstanding ten other quantization papers. The core insight — equivalence between scaling activations down and scaling weights up — lets you turn 'impossible-to-quantize' activations into 'easy-to-quantize' ones.
The problem with activation outliers
LLM activations have a small number of channels with much larger magnitudes than the rest (often 100x). INT8 per-tensor quantization clips these, destroying quality. Per-channel activation quantization is computationally awkward at runtime.
The mathematical identity
In matrix multiplication Y = X·W, you can divide X by a per-channel scale S and multiply W by S without changing the result: Y = (X/S) · (S·W). The trick: pick S so X/S has smaller outliers (easier to quantize), even though W·S has larger ones (still easy to quantize at quantization-time per-output-channel).
How to pick S
S_j = max(|X_j|)^α / max(|W_j|)^(1-α). α typically 0.5 — splits difficulty between activations and weights. Calibrate with ~512 samples to find max activations per channel.
Why this preserves quality
Mathematical equivalence: identical results in FP. Quantization error is reduced because both X and W now have well-bounded magnitudes. Per-tensor INT8 quantization becomes viable; no special hardware needed.
Where SmoothQuant fits
Pre-applied to weights at quantization time (one-time). Activation scaling fused into LayerNorm's scale parameter (no runtime overhead). Standard step in INT8 quantization pipelines like LLM-Compressor, optimum-intel.