INT8 weight quantization is nearly free. INT8 activation quantization is where it gets interesting — activations have outliers that destroy naive quantization quality. SmoothQuant solved this; the technique is now standard but worth understanding.
The outlier problem
LLM activations have a small number of channels with much larger magnitudes than the rest (often 100x). Naive per-tensor INT8 quantization clips these outliers, destroying quality. Per-channel quantization is computationally awkward.
SmoothQuant's trick
Migrate the difficulty from activations (hard) to weights (easy) via a mathematical equivalence: scale activations down by S, scale weights up by S. Activation outliers smoothed; weight quantization handles the new magnitude fine. No quality loss.
Calibration data choice
Use ~1000 samples representative of your inference workload. Random web text is OK as a starting baseline; domain data matters for domain-fine-tuned models. Calibration is O(seconds) — quick to iterate.
Where it sits in the stack
Post-training, after model is trained. Implemented in optimum-intel, vLLM, TensorRT-LLM. Activations: INT8 at inference. Weights: INT8 (often + INT4 mixed-precision variants).
Limits
Doesn't help with attention scores quantization (which has its own outlier story; FP8 K/V cache is the modern answer). Doesn't help with very low bit-width activations (INT4 activations still hard).