Advertisement
FFN: expand to d_ff, activate, project back. ~2/3 of all transformer params.
What you're seeing
Standard FFN: x → linear(d, d_ff) → activation → linear(d_ff, d). Params = 2·d·d_ff.
SwiGLU: 3 projections (gate, up, down). d_ff reduced to ~2.67× to match param budget.
★ KEY TAKEAWAY
FFN expands hidden_dim by ~4× (or ~2.67× for SwiGLU), applies activation, projects back. Holds 2/3 of all transformer params.
▶ WHAT TO TRY
- Toggle between ReLU / GELU / SwiGLU — see the activation curve.
- Increase d_ff multiplier to see how params scale linearly with hidden size.
- SwiGLU adds a 3rd projection (gate) but is empirically slightly better.