Each transformer block has an attention sub-block and an MLP (feed-forward) sub-block. The MLP is two linear layers with a nonlinearity. Despite simplicity, it holds the majority of model parameters and does much of the compute.
Standard MLP
FFN(x) = W_2 · activation(W_1 · x + b_1) + b_2
W_1 ∈ ℝ^(d × d_ff)
W_2 ∈ ℝ^(d_ff × d)
d_ff usually = 4 * dProject to a hidden dimension d_ff (typically 4× d_model), apply nonlinearity, project back. Two matmuls + activation. The hidden dim's size determines model expressiveness in the per-token feature mixing.
Activation choices
ReLU(x) = max(0, x)
GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715*x³)))
SwiGLU(x, gate) = Swish(gate) * x where Swish(x) = x * sigmoid(x)ReLU: original. GELU: smoother, used in GPT-2/BERT/3. SwiGLU: gated activation, current SOTA. Empirically SwiGLU gives ~1% better perplexity at same compute. Used in Llama, Phi, Mistral.
SwiGLU full formula
FFN_SwiGLU(x) = W_2 · (Swish(W_gate · x) * W_up · x)
Three linear layers instead of two:
W_gate ∈ ℝ^(d × d_ff)
W_up ∈ ℝ^(d × d_ff)
W_2 ∈ ℝ^(d_ff × d)SwiGLU has 3 projections (gate, up, down). To match parameter budget of standard FFN, d_ff is reduced to ~2.67× d_model. Small efficiency loss; modest quality gain.
Parameter count
Standard FFN: 2 * d * d_ff = 8 * d² (with d_ff = 4d)
SwiGLU: 3 * d * d_ff ≈ 8 * d² (with d_ff ≈ 2.67d)For d=2048: ~33M params per FFN block. With L=24 layers: 800M just for FFN. The FFN dominates total parameters in most transformers. The block is also the slowest at inference (memory bandwidth on weight reads).
Per-token, per-position
The FFN operates token-by-token: each token's vector goes through the same FFN independently. No cross-token mixing here. Easy to parallelize across the sequence dimension. On CPU, each token's FFN is a batched matmul.