Advertisement
Each row = query position; each column = key position. Green = attends; dark = masked.
What you're seeing
The attention mask determines which positions can see which.
Causal (GPT, Llama): each token attends only to previous tokens. Lower triangular.
Bidirectional (BERT): each token attends to all positions. Full matrix.
Prefix-LM (T5): bidirectional on the prompt, causal on the generation.
Sliding window (Phi, Mistral): each token attends to the last W positions only. Diagonal band.
★ KEY TAKEAWAY
Different mask = different attention pattern. Causal for autoregressive LMs, bidi for BERT, prefix-LM for T5, sliding for Mistral.
▶ WHAT TO TRY
- Switch between the four mask types.
- Notice causal is the lower triangle; sliding is a band; bidi is full.