Scaled dot-product
Q, K, V from linear projections. QK^T scores. Scale by √d to keep softmax gradient healthy. Softmax → weights.
Advertisement
Multi-head
Multiple attention 'heads' in parallel, each with own projections. Different heads learn different relations.
Advertisement
Self-attention
Q, K, V all from same input. Each position attends to all others (context window).