Multi-Head Split + Concat — Belgavi.AI Lab

Advertisement

d_model h heads 3

d_model = h · d_k. Splitting is a reshape; no extra compute.

One linear projection W_Q produces d_model features. Reshape to (N, h, d_k) treats them as h heads of d_k each.

After attention: reshape concat to (N, d_model). Final W_O mixes heads' outputs.

★ KEY TAKEAWAY

Multi-head attention = one big linear projection + reshape into h heads. No extra compute, just a layout view.

▶ WHAT TO TRY