Attention Mask Visualizer — Belgavi.AI Lab

Advertisement

Causal Bidirectional Prefix-LM Sliding window

Each row = query position; each column = key position. Green = attends; dark = masked.

The attention mask determines which positions can see which.

Causal (GPT, Llama): each token attends only to previous tokens. Lower triangular.

Bidirectional (BERT): each token attends to all positions. Full matrix.

Prefix-LM (T5): bidirectional on the prompt, causal on the generation.

Sliding window (Phi, Mistral): each token attends to the last W positions only. Diagonal band.

★ KEY TAKEAWAY

Different mask = different attention pattern. Causal for autoregressive LMs, bidi for BERT, prefix-LM for T5, sliding for Mistral.

▶ WHAT TO TRY