Advertisement
LayerNorm: (x - mean) / std. RMSNorm: x / RMS. One stat instead of two.
What you're seeing
Empirically: dropping mean centering barely affects quality but saves arithmetic. Every recent open LLM (Llama, Mistral, Phi) uses RMSNorm.
★ KEY TAKEAWAY
RMSNorm = LayerNorm minus mean centering. ~10–15% faster, same quality, modern default for Llama/Mistral/Phi.
▶ WHAT TO TRY
- Click Resample on inputs with non-zero mean.
- LayerNorm centers them to mean=0. RMSNorm keeps the mean.
- Empirically, this difference doesn't hurt quality on transformer-shaped models.