Almost all transformer compute is matrix multiplication. CPU implementations live in BLAS libraries (OpenBLAS, MKL, BLIS). They use three tricks: cache blocking, SIMD vectorization, and threading. Understanding them helps you debug 'why is this 100× slower than expected'.
Naive matmul
for i in range(M):
for j in range(N):
for k in range(K):
C[i,j] += A[i,k] * B[k,j]Three nested loops. O(MNK) multiplications. Cache-unfriendly: B[k,j] accesses jump by N every k iteration. For K=2048, that's 8 KB strides — almost certain L1 cache miss every access. Naive implementation: ~1 GFLOPS. Modern CPU peak: 100+ GFLOPS.
Cache blocking (tiling)
# Process MxN block of C with KxN slab of B, MxK slab of A:
for i in range(0, M, M_block):
for j in range(0, N, N_block):
for k in range(0, K, K_block):
# multiply A[i:i+M_block, k:k+K_block] *
# B[k:k+K_block, j:j+N_block]
# add to C[i:i+M_block, j:j+N_block]Choose block sizes so the working set fits in L1 cache (~32 KB). Each load is amortized over multiple compute ops. Typical: ~10× speedup vs naive.
SIMD vectorization
# AVX-512: process 16 FP32 or 32 BF16 values in parallel
# AMX (Sapphire Rapids+): tile registers, 1024 BF16 ops per cycleSingle instructions operating on vector registers. AVX-512 = 16 floats per cycle. AMX adds 2D tile registers for matrix tile multiplies — designed exactly for transformer-shaped workloads. ~4-16× speedup over scalar code.
Threading via OpenMP
# Parallelize outer loops:
#pragma omp parallel for collapse(2)
for i in range(0, M, M_block):
for j in range(0, N, N_block):
# ... block matmul ...Each thread handles different output blocks of C. Linear scaling up to ~8 cores; diminishing returns past that due to memory bandwidth. Set OMP_NUM_THREADS to physical core count (not hyperthreads) for best results.
What this means for PyTorch
torch.matmul calls into BLAS for the heavy lifting. On Intel: MKL (best). On AMD: BLIS or OpenBLAS. The library matters: BLIS vs naive OpenBLAS can be 2× different on Genoa. Verify your install: torch.__config__.show().