Residual (skip) connections add the input directly to the output of each sub-block. Without them, training transformers past a handful of layers is nearly impossible. Knowing why takes only a small bit of calculus.
The pattern in transformer blocks
x_out = x_in + sublayer(x_in)
# in a transformer block (pre-norm):
x = x + Attention(Norm(x))
x = x + FFN(Norm(x))Two residuals per block: one around attention, one around FFN. Pre-norm puts the norm INSIDE the residual. Post-norm puts it after; pre-norm is much easier to train for deep nets and is the modern default.
Gradient flow through residual
y = x + f(x)
∂y/∂x = I + ∂f/∂x
∂L/∂x = ∂L/∂y · (I + ∂f/∂x)
= ∂L/∂y + ∂L/∂y · ∂f/∂xThe gradient gets a direct path back (the I from the identity branch) plus the indirect path through the sublayer. If sublayer's gradient is small, the direct path keeps gradients alive — no vanishing.
Depth-wise stacking math
After L layers:
∂L/∂x_0 = ∂L/∂x_L · (I + ∂f_L/∂x_(L-1)) · ... · (I + ∂f_1/∂x_0)Expanding the product, every term is either I (pure identity path) or a product including some ∂f_k. The pure-identity term contributes ∂L/∂x_L to ∂L/∂x_0 directly — gradients reach early layers regardless of depth.
Without residuals: vanishing gradients
Without skip connections, ∂L/∂x_0 = ∂L/∂x_L · ∏ ∂f_k/∂x_(k-1). Each Jacobian's spectral radius typically < 1 → product shrinks exponentially in depth. After 24 layers, the gradient is ~10^-10. Training stalls. ResNet (2015) showed this; transformers inherited the fix.
Storage layout — residual addition is cheap
A residual add is one element-wise sum: O(N·d) per layer. Trivial vs the matmuls (O(N·d²)). On CPU it's BW-bound: stream the two inputs through cache, write the sum. Fuses naturally with the next op via kernel fusion (e.g., in vllm).