Advertisement
One instruction does many scalar ops. AMX does whole matmul tiles.
What you're seeing
AVX-512: 16 FP32 multiply-adds per cycle. AMX: tile matmul - 1024 BF16 ops per cycle.
★ KEY TAKEAWAY
SIMD processes many values per instruction. AVX-512 = 16 FP32. AMX = 1024 BF16 (tile matmul). The whole point of modern CPU AI.
▶ WHAT TO TRY
- Switch between AVX2 / AVX-512 / AMX tile.
- AMX is the game-changer: one instruction does a 16×16 matmul (4096 MACs).