Attention Score Matrix — Belgavi.AI Lab

Advertisement

Causal mask Scale by sqrt(d_k)

Each cell [i,j] is the attention score (similarity) from query i to key j.

Q, K projections of token embeddings. Scores = Q·Kᵀ / sqrt(d_k). Softmax over rows → attention weights.

Causal mask: -inf above diagonal so each token only attends to previous tokens (for autoregressive LMs).

★ KEY TAKEAWAY

Attention scores = Q·Kᵀ. Each cell [i,j] is how much query i attends to key j. Softmax over rows gives a distribution.

▶ WHAT TO TRY

Toggle Causal mask to see how -∞ above the diagonal forces autoregressive behavior.
Toggle Scale by sqrt(d_k) to see how it keeps the softmax from saturating.
Click Resample Q, K for new patterns.