▶ Interactive Lab

Feed-Forward Layer (MLP)

Hidden dimension expansion + activation + projection back.

Advertisement
MLP: hidden → hidden*4 → activation → hidden. Two-thirds of model params live here.

What you're seeing

The MLP block typically expands hidden_dim by 4× (or 2.67× for SwiGLU which has 3 projections), applies an activation, projects back. Per-token computation; no cross-token interaction.

SwiGLU (Llama, PaLM): combines Swish activation with gated linear unit. Subtle quality win that stuck. Replaces ReLU/GELU in modern LLMs.

★ KEY TAKEAWAY
FFN expands by 4× (or 2.67× SwiGLU), applies activation, projects back. ~2/3 of all transformer params.
▶ WHAT TO TRY
  • Switch activations: see SwiGLU's distinctive gate.
  • Slide Hidden mult — bigger = more capacity, more compute.