Mixed Precision Training — BF16 on CPU

Float32 is the default but uses 4 bytes per parameter and forces all arithmetic in 32-bit. BF16 (Brain Float 16) keeps 8-bit exponent (same range as FP32) with 7-bit mantissa. Halves memory and 2× throughput on supporting hardware. Modern CPUs (Intel AMX, AMD AVX-512 BF16) support it.

Advertisement

Why BF16 not FP16

FP16 (IEEE half):  5-bit exp, 10-bit mantissa  range ±65504
BF16:              8-bit exp, 7-bit mantissa   range ±3e38 (same as FP32)

FP16's narrow exponent range causes overflow in some attention computations. BF16 keeps FP32's range, so the gradient values that matter never blow up. The smaller mantissa = less precision, but training is robust to it for transformer-shaped models.

Mixed-precision recipe

# 1. Keep master weights in FP32
# 2. Cast to BF16 for forward pass
# 3. Compute loss in FP32 (numerical stability)
# 4. Backward in BF16
# 5. Cast gradients to FP32 to update FP32 master weights

Forward/backward in BF16 = 2× faster + 2× less memory for activations. Master weights and optimizer state stay in FP32 to preserve precision in updates (which are tiny relative to weight magnitudes).

Advertisement

Loss scaling for FP16

# FP16 only — BF16 doesn't need this:
scale = 1024
loss_scaled = loss * scale
loss_scaled.backward()
# unscale gradients
for p in params: p.grad /= scale
optim.step()

FP16's narrow range caused gradient underflow (tiny values rounded to 0). Loss scaling fixes this. BF16 has the same range as FP32, so loss scaling isn't needed. Simpler code.

CPU support landscape

Intel Sapphire Rapids+ (AMX): native BF16 matmul, ~2-4× speedup. AMD Genoa+: AVX-512-BF16, similar. ARM Neoverse N2/V2: BF16 matrix extension. PyTorch's torch.autocast(device_type='cpu', dtype=torch.bfloat16) turns it on. Quietly transformative for CPU training/inference.

Trade-offs

BF16 sometimes gives slightly worse final loss than FP32 (~0.5-1% on validation). For SLMs being fine-tuned (not pretrained), usually invisible. Use BF16 for fine-tuning; consider FP32 for the last 10% of pretraining for max quality.

BF16 = FP32 range + half memory. AMX/AVX-512 BF16 doubles CPU throughput. Master FP32 weights; BF16 forward/backward.