Wrong init = no training. Right init = stable, fast convergence. The math behind Xavier/Kaiming/Truncated-normal init isn't deep but matters enormously. Transformers have specific conventions.

Advertisement

Zero init: dead

If all weights start at 0, gradients are also zero. Network outputs zero. Updates are zero. Network never trains. Trivial but instructive — initialization can't be neutral.

Constant variance principle

If output of each layer has variance ~1, gradient at each layer has variance ~1, then the entire network is well-conditioned. Otherwise: vanishing or exploding activations + gradients.

Advertisement

Xavier (Glorot) — for sigmoid/tanh

# For W ∈ ℝ^(d_in × d_out):
var(W) = 1 / d_in   # forward-pass: maintain variance
# or
var(W) = 2 / (d_in + d_out)   # average forward and backward

Suited for symmetric activations (tanh). Variance of pre-activation matches variance of input. Avoids early-layer activation collapse.

Kaiming (He) — for ReLU

# ReLU zeros out half the inputs:
var(W) = 2 / d_in   # compensate for the dead half

ReLU sets negative inputs to zero, halving the effective signal. Kaiming init doubles the variance to compensate. Critical for deep ReLU networks; less critical for SwiGLU/GELU but still good practice.

Transformer specifics

# GPT / Llama init recipe:
# - Embedding: N(0, 0.02)
# - Linear weights: N(0, 0.02)
# - LayerNorm γ: 1, β: 0
# - Final layer scaling: divide by sqrt(2 * n_layers) to keep residual stable

Standard for transformers. The final-layer scaling is the trick that makes deep transformers train without warmup-induced instability. Used in GPT-2, Llama, Phi.

Init weights so var stays ~1. Kaiming for ReLU. Special transformer rule: scale final layer by 1/sqrt(2L).