Modern CPUs have multiple cache levels: L1 (~32 KB, ~1 ns), L2 (~512 KB, ~3 ns), L3 (~32 MB, ~10 ns), RAM (~64 GB, ~80 ns). A 350M Q4 model is 200 MB — too big for cache. Inference is bottlenecked by reading weights from RAM.
Bandwidth, not FLOPs
Modern CPU: 100-500 GFLOPS, 50-100 GB/s memory bandwidth. For matmul with reused weights (training): FLOPs dominate. For matmul reading weights once (inference batch=1): bandwidth dominates. SLM inference is BW-bound at batch=1.
Inference cost = weight bytes / bandwidth
# Phi-3 (3.8B) at INT4: ~2 GB weights
# RAM bandwidth: 70 GB/s
# Lower bound per token: 2 GB / 70 GB/s = 28 ms
# Real measured: ~30-50 ms per tokenEach token generation must read the entire model from RAM (weights + KV cache). Bandwidth divided by model size = max tokens/sec. No clever algorithm beats this for batch=1.
Quantization speeds it up linearly
FP16 weights: 7.6 GB for Phi-3 → ~108 ms/token bound
INT8: 3.8 GB → ~54 ms/token bound
INT4: 1.9 GB → ~27 ms/token boundINT4 quantization is 4× faster than FP16 inference simply because there's 4× less data to read. The arithmetic units are still fast enough; the wire to RAM is the choke point.
Batching helps amortize
If you process 8 prompts in parallel (batch=8), weights are read once per layer and used 8 times. Cost per token: ~3 ms instead of 30 ms. But: 8× the KV cache, 8× the activation memory. CPU inference often batch=1 because RAM is tight.
Software prefetching
Llama.cpp and ONNX Runtime use prefetch instructions to bring the NEXT layer's weights into cache while processing the current layer. Hides ~30% of the BW gap. __builtin_prefetch in C++; less control in Python.