▶ Interactive Lab

Multi-Head Split + Concat

One big projection reshapes into h heads then back.

Advertisement
d_model = h · d_k. Splitting is a reshape; no extra compute.

What you're seeing

One linear projection W_Q produces d_model features. Reshape to (N, h, d_k) treats them as h heads of d_k each.

After attention: reshape concat to (N, d_model). Final W_O mixes heads' outputs.

★ KEY TAKEAWAY
Multi-head attention = one big linear projection + reshape into h heads. No extra compute, just a layout view.
▶ WHAT TO TRY
  • Increase h to see d_model split into more heads (each smaller).
  • Each head will compute attention independently in d_k=d/h space.
  • GQA shares K/V across query heads to save KV-cache memory.