Residual Gradient Flow — Belgavi.AI Lab

Advertisement

Depth L 24 Sublayer ‖∂f‖ 0.50

Without residuals: gradient ≈ ‖∂f‖^L. Vanishes for deep nets. With residuals: ≈ 1 (direct path).

What you're seeing

Plot ‖∂L/∂x_0‖ as a function of depth. Without skip connections: product shrinks exponentially. With skips: each layer contributes I + ∂f → identity path keeps gradient ~1.

★ KEY TAKEAWAY

Residual connections keep gradients alive at any depth. Without them, gradients vanish exponentially in layer count.

▶ WHAT TO TRY

Increase Depth L to 48 and watch the red (no-residual) curve crash to 10⁻¹⁰.
The green (with-residual) curve stays near 1 regardless of depth — that's why we can train 100-layer transformers.