CPU Cache Hierarchy and Transformer Inference

Modern CPUs have multiple cache levels: L1 (~32 KB, ~1 ns), L2 (~512 KB, ~3 ns), L3 (~32 MB, ~10 ns), RAM (~64 GB, ~80 ns). A 350M Q4 model is 200 MB — too big for cache. Inference is bottlenecked by reading weights from RAM.

Advertisement

Bandwidth, not FLOPs

Modern CPU: 100-500 GFLOPS, 50-100 GB/s memory bandwidth. For matmul with reused weights (training): FLOPs dominate. For matmul reading weights once (inference batch=1): bandwidth dominates. SLM inference is BW-bound at batch=1.

Inference cost = weight bytes / bandwidth

# Phi-3 (3.8B) at INT4: ~2 GB weights
# RAM bandwidth: 70 GB/s
# Lower bound per token: 2 GB / 70 GB/s = 28 ms
# Real measured: ~30-50 ms per token

Each token generation must read the entire model from RAM (weights + KV cache). Bandwidth divided by model size = max tokens/sec. No clever algorithm beats this for batch=1.

Advertisement

Quantization speeds it up linearly

FP16 weights: 7.6 GB for Phi-3 → ~108 ms/token bound
INT8:        3.8 GB → ~54 ms/token bound
INT4:        1.9 GB → ~27 ms/token bound

INT4 quantization is 4× faster than FP16 inference simply because there's 4× less data to read. The arithmetic units are still fast enough; the wire to RAM is the choke point.

Batching helps amortize

If you process 8 prompts in parallel (batch=8), weights are read once per layer and used 8 times. Cost per token: ~3 ms instead of 30 ms. But: 8× the KV cache, 8× the activation memory. CPU inference often batch=1 because RAM is tight.

Software prefetching

Llama.cpp and ONNX Runtime use prefetch instructions to bring the NEXT layer's weights into cache while processing the current layer. Hides ~30% of the BW gap. __builtin_prefetch in C++; less control in Python.

SLM CPU inference is bandwidth-bound. Tokens/sec ≈ RAM BW / model size. Quantize to INT4 for 4× speedup.