▶ Interactive Lab

CPU Inference Latency Breakdown

Per-token time = bandwidth-bound weight reads + compute.

Advertisement
Per-token latency lower bound = model_bytes / RAM_bandwidth.

What you're seeing

At batch=1, must read all weights per token. RAM bandwidth is the floor on inference speed.

★ KEY TAKEAWAY
CPU SLM inference is memory-bandwidth-bound. Per-token time ≈ model_size / RAM_BW.
▶ WHAT TO TRY
  • Pick a model size — see the BW-bound lower latency bound.
  • Compare DDR5 (70 GB/s) vs Apple M3 Max (200 GB/s) — proportional speedup.
  • This is why INT4 (4× smaller weights) is the standard for CPU inference.