RMSNorm is a simplification of LayerNorm: skip the mean subtraction, divide only by RMS. Faster, same quality in practice, used by every modern open LLM (Llama, Mistral, Phi, Qwen). The math reduction is straightforward.
Forward pass
# Drop the mean centering:
RMS = sqrt((1/d) * sum(x²) + ε)
output = γ * x / RMS
# γ ∈ ℝ^d is learnable; no βJust one statistic (RMS) instead of two (mean + variance). One scale parameter instead of scale+shift. Less arithmetic per element. ~10-15% faster on most hardware.
Why drop the mean?
Zhang & Sennrich (2019) showed empirically that the mean subtraction in LayerNorm contributes little to model quality on transformers. The variance normalization does the heavy lifting. Dropping mean = simpler kernels, faster training/inference, same loss curves.
Backward pass
# Cleaner than LayerNorm:
dx = (γ / RMS) * (dy - (x · sum(dy * x)) / (d * RMS²))Only one reduction term to backprop through (the RMS) vs two for LayerNorm. Slightly cheaper gradient compute.
Adoption picture
Llama (all sizes), Mistral, Phi-2/3/4, Qwen 2/2.5, Gemma, DeepSeek, MPT — all use RMSNorm. BERT and GPT-2 era models used LayerNorm. Newer code defaults to RMSNorm; old loaders need both code paths.
CPU implementation
Single pass over the d-dim vector to compute sum of squares; second pass to divide and scale. Fused kernel reads activations once. Critical for SLM inference latency — happens twice per transformer block, every layer.