RMSNorm — The Modern Default

RMSNorm is a simplification of LayerNorm: skip the mean subtraction, divide only by RMS. Faster, same quality in practice, used by every modern open LLM (Llama, Mistral, Phi, Qwen). The math reduction is straightforward.

Advertisement

Forward pass

# Drop the mean centering:
RMS    = sqrt((1/d) * sum(x²) + ε)
output = γ * x / RMS

# γ ∈ ℝ^d is learnable; no β

Just one statistic (RMS) instead of two (mean + variance). One scale parameter instead of scale+shift. Less arithmetic per element. ~10-15% faster on most hardware.

Why drop the mean?

Zhang & Sennrich (2019) showed empirically that the mean subtraction in LayerNorm contributes little to model quality on transformers. The variance normalization does the heavy lifting. Dropping mean = simpler kernels, faster training/inference, same loss curves.

Advertisement

Backward pass

# Cleaner than LayerNorm:
dx = (γ / RMS) * (dy - (x · sum(dy * x)) / (d * RMS²))

Only one reduction term to backprop through (the RMS) vs two for LayerNorm. Slightly cheaper gradient compute.

Adoption picture

Llama (all sizes), Mistral, Phi-2/3/4, Qwen 2/2.5, Gemma, DeepSeek, MPT — all use RMSNorm. BERT and GPT-2 era models used LayerNorm. Newer code defaults to RMSNorm; old loaders need both code paths.

CPU implementation

Single pass over the d-dim vector to compute sum of squares; second pass to divide and scale. Fused kernel reads activations once. Critical for SLM inference latency — happens twice per transformer block, every layer.

RMSNorm = LayerNorm without mean centering. 10-15% faster, same quality, modern default. Critical to fuse on CPU.