Feed-Forward Layer (MLP) — Belgavi.AI Lab

Advertisement

Activation Hidden mult 4×

MLP: hidden → hidden*4 → activation → hidden. Two-thirds of model params live here.

What you're seeing

The MLP block typically expands hidden_dim by 4× (or 2.67× for SwiGLU which has 3 projections), applies an activation, projects back. Per-token computation; no cross-token interaction.

SwiGLU (Llama, PaLM): combines Swish activation with gated linear unit. Subtle quality win that stuck. Replaces ReLU/GELU in modern LLMs.

★ KEY TAKEAWAY

FFN expands by 4× (or 2.67× SwiGLU), applies activation, projects back. ~2/3 of all transformer params.

▶ WHAT TO TRY

Switch activations: see SwiGLU's distinctive gate.
Slide Hidden mult — bigger = more capacity, more compute.