Advertisement
Each cell [i,j] is the attention score (similarity) from query i to key j.
What you're seeing
Q, K projections of token embeddings. Scores = Q·Kᵀ / sqrt(d_k). Softmax over rows → attention weights.
Causal mask: -inf above diagonal so each token only attends to previous tokens (for autoregressive LMs).
★ KEY TAKEAWAY
Attention scores = Q·Kᵀ. Each cell [i,j] is how much query i attends to key j. Softmax over rows gives a distribution.
▶ WHAT TO TRY
- Toggle Causal mask to see how -∞ above the diagonal forces autoregressive behavior.
- Toggle Scale by sqrt(d_k) to see how it keeps the softmax from saturating.
- Click Resample Q, K for new patterns.