Advertisement
Each block of C is computed from corresponding tiles of A and B.
What you're seeing
C[i:i+bs, j:j+bs] += A[i:i+bs, :] · B[:, j:j+bs]. Tiles fit in L1; reused many times.
★ KEY TAKEAWAY
Cache blocking: tile matmul so each block fits in L1. Reuse data many times before evicting. ~10× speedup over naive.
▶ WHAT TO TRY
- Step through the blocks to see one tile of C being computed.
- Each tile of A and B is loaded once but used multiple times.
- BLAS libraries do this automatically, tuned for your CPU's cache sizes.