Multi-Head Attention — Belgavi.AI Lab

Advertisement

Heads 4

Each head sees the same input but learns different attention patterns.

What you're seeing

Multi-head attention runs N parallel attention computations on projections of the input. Each head can specialize: one tracks syntax, another semantic similarity, another positional patterns.

The outputs concatenate and project back. GQA (Llama 2/3): query heads outnumber K/V heads to save KV cache memory while preserving multi-head benefits.

★ KEY TAKEAWAY

Different heads learn different patterns: position, identity, syntax, semantics. Multi-head ≠ single big head.

▶ WHAT TO TRY

Increase Heads to 8 — see distinct patterns per head.
Click Resample to see new initializations.