LayerNorm — Math and Backward

Layer Normalization normalizes a layer's activations to zero mean and unit variance, then applies learnable scale and shift. It's the variance-control mechanism that makes deep transformers trainable. Knowing its forward and backward forms helps debug numerical issues and explains why RMSNorm is faster.

Advertisement

Forward pass

# For each token's vector x ∈ ℝ^d:
μ      = (1/d) * sum(x)
σ²     = (1/d) * sum((x - μ)²)
x_norm = (x - μ) / sqrt(σ² + ε)
output = γ * x_norm + β

# γ, β ∈ ℝ^d are learnable

Compute statistics ALONG the feature dim, INDEPENDENTLY for each token. ε ≈ 1e-5 prevents divide-by-zero. γ (scale) and β (shift) let the model 'undo' normalization where useful.

Per-token = no batch dependence

Unlike BatchNorm, LayerNorm doesn't compute statistics across the batch — only across feature dim of a single token. Same forward at train and inference time. No need to track running averages. Critical for autoregressive generation where batch=1.

Advertisement

Backward pass

# d_out / d_x is non-trivial due to mean and variance terms:
dx = (1/d) * (1/sqrt(σ²+ε)) *
     (d * dx_norm - sum(dx_norm) - x_norm * sum(dx_norm * x_norm))

Three terms because changing one x[i] changes the mean (affects all positions), the variance (affects all positions), and contributes directly. PyTorch handles this. Worth knowing for debugging exploding gradients near norm layers.

Parameter count is tiny

γ, β are vectors of size d. 2·d params per LN. For d=2048 and L=24 layers, each block has 2 norms → ~200K params for ALL norms in the model. Negligible vs attention/FFN params (~1B). Compute is also trivial.

Storage and CPU performance

LN reads N·d activations, computes 2 reductions (mean, var), writes back. Memory-bandwidth bound. On CPU: stream-once is ideal — fuse with the next matmul if possible. Compilers (e.g. tinygrad, OpenVINO) often fuse LN with adjacent ops.

LayerNorm: per-token, mean+var normalize, learnable scale+shift. Batch-independent. Trivial param count, BW-bound on CPU.