▶ Interactive Lab

Cache Blocking for Matmul

Tile the matrices so each block fits in cache.

Advertisement
Each block of C is computed from corresponding tiles of A and B.

What you're seeing

C[i:i+bs, j:j+bs] += A[i:i+bs, :] · B[:, j:j+bs]. Tiles fit in L1; reused many times.

★ KEY TAKEAWAY
Cache blocking: tile matmul so each block fits in L1. Reuse data many times before evicting. ~10× speedup over naive.
▶ WHAT TO TRY
  • Step through the blocks to see one tile of C being computed.
  • Each tile of A and B is loaded once but used multiple times.
  • BLAS libraries do this automatically, tuned for your CPU's cache sizes.