Advertisement
Without residuals: gradient ≈ ‖∂f‖^L. Vanishes for deep nets. With residuals: ≈ 1 (direct path).
What you're seeing
Plot ‖∂L/∂x_0‖ as a function of depth. Without skip connections: product shrinks exponentially. With skips: each layer contributes I + ∂f → identity path keeps gradient ~1.
★ KEY TAKEAWAY
Residual connections keep gradients alive at any depth. Without them, gradients vanish exponentially in layer count.
▶ WHAT TO TRY
- Increase Depth L to 48 and watch the red (no-residual) curve crash to 10⁻¹⁰.
- The green (with-residual) curve stays near 1 regardless of depth — that's why we can train 100-layer transformers.