Advertisement
Right init keeps variance ~1 through the network.
What you're seeing
Zero init kills training. Right init: var(W) controlled by fan_in.
★ KEY TAKEAWAY
Right init keeps variance ~1 through the network. Xavier for tanh, Kaiming for ReLU, N(0, 0.02) for transformers.
▶ WHAT TO TRY
- Try Zero init — broken (all activations zero, no training).
- Try Uniform — way too wide, training unstable.
- Try Normal(0, 0.02) — the transformer default.