Layer Normalization normalizes a layer's activations to zero mean and unit variance, then applies learnable scale and shift. It's the variance-control mechanism that makes deep transformers trainable. Knowing its forward and backward forms helps debug numerical issues and explains why RMSNorm is faster.
Forward pass
# For each token's vector x ∈ ℝ^d:
μ = (1/d) * sum(x)
σ² = (1/d) * sum((x - μ)²)
x_norm = (x - μ) / sqrt(σ² + ε)
output = γ * x_norm + β
# γ, β ∈ ℝ^d are learnableCompute statistics ALONG the feature dim, INDEPENDENTLY for each token. ε ≈ 1e-5 prevents divide-by-zero. γ (scale) and β (shift) let the model 'undo' normalization where useful.
Per-token = no batch dependence
Unlike BatchNorm, LayerNorm doesn't compute statistics across the batch — only across feature dim of a single token. Same forward at train and inference time. No need to track running averages. Critical for autoregressive generation where batch=1.
Backward pass
# d_out / d_x is non-trivial due to mean and variance terms:
dx = (1/d) * (1/sqrt(σ²+ε)) *
(d * dx_norm - sum(dx_norm) - x_norm * sum(dx_norm * x_norm))Three terms because changing one x[i] changes the mean (affects all positions), the variance (affects all positions), and contributes directly. PyTorch handles this. Worth knowing for debugging exploding gradients near norm layers.
Parameter count is tiny
γ, β are vectors of size d. 2·d params per LN. For d=2048 and L=24 layers, each block has 2 norms → ~200K params for ALL norms in the model. Negligible vs attention/FFN params (~1B). Compute is also trivial.
Storage and CPU performance
LN reads N·d activations, computes 2 reductions (mean, var), writes back. Memory-bandwidth bound. On CPU: stream-once is ideal — fuse with the next matmul if possible. Compilers (e.g. tinygrad, OpenVINO) often fuse LN with adjacent ops.