FFN Expansion + Activation

Advertisement

d d_ff multiplier 4× Activation

FFN: expand to d_ff, activate, project back. ~2/3 of all transformer params.

Standard FFN: x → linear(d, d_ff) → activation → linear(d_ff, d). Params = 2·d·d_ff.

SwiGLU: 3 projections (gate, up, down). d_ff reduced to ~2.67× to match param budget.

★ KEY TAKEAWAY

FFN expands hidden_dim by ~4× (or ~2.67× for SwiGLU), applies activation, projects back. Holds 2/3 of all transformer params.

▶ WHAT TO TRY