Attention Mechanism — Transformer Foundation

Scaled dot-product

Q, K, V from linear projections. QK^T scores. Scale by √d to keep softmax gradient healthy. Softmax → weights.

Advertisement

Multiple attention 'heads' in parallel, each with own projections. Different heads learn different relations.

Advertisement

Q, K, V all from same input. Each position attends to all others (context window).