Backpropagation is reverse-mode automatic differentiation. For a transformer, gradients flow back from the loss through softmax, FFN, attention, embeddings. The chain rule mechanizes this; PyTorch's autograd does it automatically. Knowing the structure helps debug exploding/vanishing gradients.

Advertisement

The chain rule, vector form

If y = f(x), and L is a scalar loss:
  ∂L/∂x = ∂L/∂y · ∂y/∂x        (Jacobian product)

For matrix operations, this becomes Jacobian-vector products. We never materialize huge Jacobians; we compute the product directly using the structure of each operation. Each layer defines its own backward pass.

Backprop through matmul

Forward:   Y = X · W
Backward:
  dL/dX = dL/dY · Wᵀ
  dL/dW = Xᵀ · dL/dY

Two more matmuls in the backward pass — one of the reasons training is 2-3× slower than inference for the same model. Each parameter's gradient is computed once per backward pass; an optimizer step then updates it.

Advertisement

Backprop through residual connection

Forward:   y = x + sublayer(x)
Backward:  dL/dx = dL/dy + dL/d(sublayer) · ∂sublayer/∂x

Residual connections add the input directly to the output. Their gradient passes through unchanged (the '1' from the identity branch) plus the sublayer's gradient. This is why deep networks with residuals don't have vanishing gradients — there's always a direct path back.

Backprop through layer norm

LayerNorm subtracts mean, divides by std, then applies learnable scale/shift. Its gradient involves the mean and variance of the input batch — slightly involved but standard. The key property: gradients are well-conditioned because LN normalizes activations layer-by-layer.

Backprop through attention

Softmax · V is straightforward. The dot-product scores need the gradient of softmax (from earlier article) and a transpose. The K and V matrices get gradients from each Q position they attended to. PyTorch's scaled_dot_product_attention implements this efficiently — don't roll your own.

Chain rule + a few standard backward rules = autograd. Residuals keep gradients flowing. Norm keeps them well-scaled.