End-to-End CPU SLM Recipe — Belgavi.AI Lab

Putting it all together: train and serve a small (50M-350M) language model entirely on a CPU workstation. Practical and educational, even if you wouldn't deploy this at scale. Here's the full path.

Advertisement

Hardware target

# Reasonable workstation:
# - 32 GB RAM
# - 16-core CPU with AVX-512 (Intel 12th gen+ or AMD 7000+)
# - 1 TB NVMe SSD
# - No GPU (the point)

# What it can train:
# - 50-125M params from scratch (small but works)
# - 350M with grad checkpointing + BF16
#
# What it can run inference on:
# - Up to 7B Q4 quantized (~4 GB RAM)

Modest hardware. CPU training is feasible for SLMs you can practice the full lifecycle on. Inference scales well past training size with quantization.

Training stack

# PyTorch with CPU backend
# - torch.compile for kernel fusion
# - torch.autocast(device='cpu', dtype=torch.bfloat16) for mixed precision
# - AdamW optimizer
# - cosine schedule with warmup
# - gradient checkpointing
#
# Data: streaming from disk via dataloader workers
# Tokenizer: tiktoken or SentencePiece pretrained

Don't roll your own framework. Use PyTorch with proper BF16 and OpenBLAS/MKL. Stream data; don't load all in RAM. Save checkpoints every N steps to disk.

Advertisement

Reference budget — 125M params

# Train tinyllama-style 125M on Wikipedia + Stories
# - batch_micro = 1, accumulation = 32, effective batch 32
# - seq = 1024
# - 3e-4 peak LR, 200-step warmup
# - 100K steps ~ 3 days on 16-core CPU
# - Final perplexity: ~25 on WikiText-103

Won't be GPT-4; will produce coherent paragraph-length text. Educational value high. Inference for prompting: 10-30 tokens/sec at INT4 on the same hardware.

Inference stack

# Convert PyTorch weights to GGUF:
#   python convert-hf-to-gguf.py model_dir/
# Quantize:
#   ./llama-quantize model.gguf model-q4_k_m.gguf q4_k_m
# Serve:
#   ./llama-server -m model-q4_k_m.gguf -ngl 0 -c 4096

llama.cpp is the production-quality CPU inference path. Convert + quantize + serve in three commands. OpenAI-compatible API exposed. Good enough for prototypes, personal assistants, prototypes.

Beyond — where GPUs become necessary

Above ~350M params, CPU training time becomes weeks. Above 7B for inference (even Q4), CPU is too slow for interactive use. The crossover: rent a GPU hour ($1-5) when CPU training stalls. CPU is for: small-model lifecycle, edge inference, learning. GPU for: scale.

125M from scratch + INT4 inference on a 32GB workstation. Full lifecycle on CPU. GPU only when CPU stalls.