Advertisement
Tiny working set: L1 fast. Mid: L2/L3. Big: RAM — slow.
What you're seeing
L1 ~32 KB, ~1 ns. L2 ~512 KB, ~3 ns. L3 ~32 MB, ~10 ns. RAM ~80 ns. Model weights bigger than L3 → every read pays RAM cost.
★ KEY TAKEAWAY
CPU memory hierarchy: L1 (32KB, 1ns) → L2 (512KB, 3ns) → L3 (32MB, 10ns) → RAM (80ns). SLM weights ≫ L3 → every read pays RAM cost.
▶ WHAT TO TRY
- Slide Working set from 1KB to 1GB.
- Notice how data 'lives' in successively slower levels.
- This is why CPU SLM inference is memory-bandwidth-bound at batch=1.