CPU Matmul Kernels and BLAS

Almost all transformer compute is matrix multiplication. CPU implementations live in BLAS libraries (OpenBLAS, MKL, BLIS). They use three tricks: cache blocking, SIMD vectorization, and threading. Understanding them helps you debug 'why is this 100× slower than expected'.

Advertisement

Naive matmul

for i in range(M):
    for j in range(N):
        for k in range(K):
            C[i,j] += A[i,k] * B[k,j]

Three nested loops. O(MNK) multiplications. Cache-unfriendly: B[k,j] accesses jump by N every k iteration. For K=2048, that's 8 KB strides — almost certain L1 cache miss every access. Naive implementation: ~1 GFLOPS. Modern CPU peak: 100+ GFLOPS.

Cache blocking (tiling)

# Process MxN block of C with KxN slab of B, MxK slab of A:
for i in range(0, M, M_block):
    for j in range(0, N, N_block):
        for k in range(0, K, K_block):
            # multiply A[i:i+M_block, k:k+K_block] *
            #          B[k:k+K_block, j:j+N_block]
            # add to C[i:i+M_block, j:j+N_block]

Choose block sizes so the working set fits in L1 cache (~32 KB). Each load is amortized over multiple compute ops. Typical: ~10× speedup vs naive.

Advertisement

SIMD vectorization

# AVX-512: process 16 FP32 or 32 BF16 values in parallel
# AMX (Sapphire Rapids+): tile registers, 1024 BF16 ops per cycle

Single instructions operating on vector registers. AVX-512 = 16 floats per cycle. AMX adds 2D tile registers for matrix tile multiplies — designed exactly for transformer-shaped workloads. ~4-16× speedup over scalar code.

Threading via OpenMP

# Parallelize outer loops:
#pragma omp parallel for collapse(2)
for i in range(0, M, M_block):
    for j in range(0, N, N_block):
        # ... block matmul ...

Each thread handles different output blocks of C. Linear scaling up to ~8 cores; diminishing returns past that due to memory bandwidth. Set OMP_NUM_THREADS to physical core count (not hyperthreads) for best results.

What this means for PyTorch

torch.matmul calls into BLAS for the heavy lifting. On Intel: MKL (best). On AMD: BLIS or OpenBLAS. The library matters: BLIS vs naive OpenBLAS can be 2× different on Genoa. Verify your install: torch.__config__.show().

Cache blocking + SIMD + threading = 100× speedup over naive. BLAS does all of this. Use MKL on Intel, BLIS/OpenBLAS on AMD.