Advertisement
Q tile × K tile fits in SRAM. Online softmax across tiles. No full N×N matrix.
What you're seeing
Same math; never materialize full attention matrix. 5-10× faster on long context.
★ KEY TAKEAWAY
FlashAttention computes attention block-by-block in SRAM, never materializing the N×N matrix. Same math; 5-10× faster on long context.
▶ WHAT TO TRY
- Click Step to advance through Q-tile, K-tile pairs.
- Online softmax (running max + sum) gives exact result without ever storing the full matrix.