Perplexity and Evaluation Metrics

Loss is the training objective. Perplexity is the human-interpretable version. Downstream eval is what actually matters. Knowing the chain helps you set goals for CPU training and know when to stop.

Advertisement

Perplexity

perplexity = exp(cross_entropy_loss)

# For a held-out test set:
# loss = -log P(token | context) averaged over tokens
# perplexity = how many tokens the model is 'choosing between' on average

Geometric mean of 1/p_correct. Lower is better. For English: GPT-2 ~30, GPT-3 ~20, Llama 3 8B ~6, GPT-4 ~5. Measures next-token uncertainty; doesn't measure reasoning or downstream task performance.

Domain perplexity vs general

Perplexity is computed on whatever text. WikiText perplexity, code perplexity, math perplexity — all different scales. Compare same-domain numbers. Cross-domain comparison is meaningless.

Advertisement

Downstream benchmarks

MMLU (knowledge), GSM8K (math), HumanEval (code), MT-Bench (chat). Each tests something different. For SLMs trained from scratch on a single GPU/CPU: only the smallest benchmarks (LAMBADA, HellaSwag) give signal. Big-model benchmarks (MMLU) need too much capability.

Custom evaluation

Pick 100 examples representative of your use case. Run model on them. Grade with LLM-as-judge or human. This is the ONLY number that matters for production. Public benchmarks are for vendor comparison, not deployment decisions.

Targets for CPU SLM training

125M from scratch on 30B tokens: aim for WikiText perplexity ~30. 350M on 50B tokens: ~20. 1B fine-tuned: depends entirely on task. Don't expect GPT-4 quality from CPU training — but useful capabilities are achievable.

Perplexity = exp(loss). Compare same-domain. Downstream benchmarks > perplexity for real signal.