Residual Connections and Gradient Flow

Residual (skip) connections add the input directly to the output of each sub-block. Without them, training transformers past a handful of layers is nearly impossible. Knowing why takes only a small bit of calculus.

Advertisement

The pattern in transformer blocks

x_out = x_in + sublayer(x_in)

# in a transformer block (pre-norm):
x = x + Attention(Norm(x))
x = x + FFN(Norm(x))

Two residuals per block: one around attention, one around FFN. Pre-norm puts the norm INSIDE the residual. Post-norm puts it after; pre-norm is much easier to train for deep nets and is the modern default.

Gradient flow through residual

y = x + f(x)
∂y/∂x = I + ∂f/∂x

∂L/∂x = ∂L/∂y · (I + ∂f/∂x)
      = ∂L/∂y + ∂L/∂y · ∂f/∂x

The gradient gets a direct path back (the I from the identity branch) plus the indirect path through the sublayer. If sublayer's gradient is small, the direct path keeps gradients alive — no vanishing.

Advertisement

Depth-wise stacking math

After L layers:
  ∂L/∂x_0 = ∂L/∂x_L · (I + ∂f_L/∂x_(L-1)) · ... · (I + ∂f_1/∂x_0)

Expanding the product, every term is either I (pure identity path) or a product including some ∂f_k. The pure-identity term contributes ∂L/∂x_L to ∂L/∂x_0 directly — gradients reach early layers regardless of depth.

Without residuals: vanishing gradients

Without skip connections, ∂L/∂x_0 = ∂L/∂x_L · ∏ ∂f_k/∂x_(k-1). Each Jacobian's spectral radius typically < 1 → product shrinks exponentially in depth. After 24 layers, the gradient is ~10^-10. Training stalls. ResNet (2015) showed this; transformers inherited the fix.

Storage layout — residual addition is cheap

A residual add is one element-wise sum: O(N·d) per layer. Trivial vs the matmuls (O(N·d²)). On CPU it's BW-bound: stream the two inputs through cache, write the sum. Fuses naturally with the next op via kernel fusion (e.g., in vllm).

Residual = identity + sublayer. Gradient gets a direct path back, killing the vanishing-gradient problem. Modern blocks are pre-norm + residual.