Advertisement
Each head sees the same input but learns different attention patterns.
What you're seeing
Multi-head attention runs N parallel attention computations on projections of the input. Each head can specialize: one tracks syntax, another semantic similarity, another positional patterns.
The outputs concatenate and project back. GQA (Llama 2/3): query heads outnumber K/V heads to save KV cache memory while preserving multi-head benefits.
★ KEY TAKEAWAY
Different heads learn different patterns: position, identity, syntax, semantics. Multi-head ≠ single big head.
▶ WHAT TO TRY
- Increase Heads to 8 — see distinct patterns per head.
- Click Resample to see new initializations.