Advertisement
d_model = h · d_k. Splitting is a reshape; no extra compute.
What you're seeing
One linear projection W_Q produces d_model features. Reshape to (N, h, d_k) treats them as h heads of d_k each.
After attention: reshape concat to (N, d_model). Final W_O mixes heads' outputs.
★ KEY TAKEAWAY
Multi-head attention = one big linear projection + reshape into h heads. No extra compute, just a layout view.
▶ WHAT TO TRY
- Increase h to see d_model split into more heads (each smaller).
- Each head will compute attention independently in d_k=d/h space.
- GQA shares K/V across query heads to save KV-cache memory.