CPU training only works if everything fits in RAM. A 350M model in FP32 with AdamW + activations needs ~6-8 GB. A 1B model needs ~24+ GB. Knowing the math tells you exactly what hardware you need.
Per-parameter cost
FP32 weights: 4 bytes
FP32 gradients: 4 bytes
AdamW first moment: 4 bytes
AdamW second moment: 4 bytes
=== total per param: 16 bytesQuadruples model size in memory just for static state. Plus activations (next section). Plus framework overhead. For 350M params: 5.6 GB just for the static state at FP32.
Activation memory
# Per forward pass, per token:
# Roughly: 20 * d_model * L activations
# For d=1024, L=16, seq=1024:
# 20 * 1024 * 16 * 1024 = ~336 MB per batch elementActivation memory scales with batch_size × seq_len × d × L. With batch=8 and seq=1024 on a 350M model: ~3 GB activations during forward pass. Stored for backward pass.
Gradient checkpointing
# Recompute activations during backward instead of storing all:
# memory: O(sqrt(L) * batch * seq * d)
# compute: 33% more forward passesTrade compute for memory. Memory drops to sqrt(L) × per-layer (e.g., 4 layers worth instead of 16). Saves ~70% activation memory at ~33% extra compute. torch.utils.checkpoint. Essential for CPU training of bigger models.
BF16 cuts in half
# With BF16 mixed precision:
weights: 4 bytes (FP32 master) + 2 bytes (BF16 copy)
gradients: 2 bytes (BF16)
optimizer: 8 bytes (FP32 m, v)
activations: 2 bytes (BF16)Mixed precision halves activation memory and doubles throughput on AMX/AVX-512-BF16 CPUs. Master weights stay FP32 to preserve optimizer precision. ~30-40% memory savings overall.
Practical CPU training table
Model size | RAM (FP32 + AdamW + acts) | RAM (BF16 mix)
---------- ---------- ----------
50M | 2 GB 1.5 GB
125M | 5 GB 3 GB
350M | 12 GB 7 GB
1.3B | 40 GB 24 GB
3B | 90 GB 55 GBPractical CPU training ceiling: 1-3B parameters on 64-128 GB workstations. Above this, you need GPU. Inference (no gradients/optimizer) tolerates much larger models.