Raw PyTorch matmul on CPU is fast but not optimal. Specialized inference engines beat PyTorch by 2-5× through fused kernels, quantization, and pipeline optimization. Pick the right one for your hardware and use case.

Advertisement

llama.cpp — the open default

C++ inference engine for LLaMA-style models. GGUF format. Supports AVX/AVX2/AVX-512/AMX, ARM NEON/SVE, Apple Metal. INT4/INT5/INT8 quantization. Active development. Used by Ollama, LM Studio. Best out-of-the-box CPU inference for popular SLMs.

ONNX Runtime

Microsoft's general-purpose ML inference engine. Exports from PyTorch via ONNX format. CPU EP uses oneDNN. Strong AMX support. Used in Phi-3's first-party Microsoft inference path. Good for: custom models, integration with .NET/Java/Python pipelines.

Advertisement

oneDNN (Intel)

Low-level kernel library. Underneath ONNX Runtime, PyTorch's MKLDNN backend, OpenVINO. Hand-tuned matmul, attention, layer norm kernels for Intel CPUs. Direct use only for advanced kernel writers; mostly accessed through PyTorch / ONNX.

vLLM CPU backend

vLLM (the GPU inference workhorse) added CPU backend in 2024. PagedAttention works on CPU too. Better batching than llama.cpp for many concurrent requests. Right when you serve many users from one CPU box.

Apple MLX

Apple's PyTorch-alike framework, optimized for Apple Silicon. Uses unified memory + Metal + AMX coprocessor. Best inference speed on M-series Macs. Llama 3 8B at ~30 tokens/sec on M3 Max.

llama.cpp for desktop. ONNX Runtime for cross-platform. vLLM CPU for multi-user serving. MLX for Apple Silicon.