Synthetic data — model outputs used as training data for other models — drove the small-model quality boom of 2024-2025. Done well it's transformative; done badly it bakes in teacher's biases and hallucinations. The discipline is filtering.

Advertisement

Filter aggressively

Generate 5x what you need; filter to top 20% by quality signal. Reward model, separate critic, code-execution check for code, automated factuality check. The filter is the whole game.

Diversity matters

Same prompt repeated 1000x gives 1000 similar examples — useless. Vary topic, style, length, difficulty. Use clustering to ensure coverage. Temperature 0.7-1.0 during generation for diversity.

Advertisement

Distillation specifics

Capture teacher's chain-of-thought, not just final answer. Soft labels (logits) if available — much richer signal than hard labels. ~30% better student outcomes for same data quantity.

Filter > generate. Diversity through clustering. Capture CoT and soft labels. Synthetic data isn't free.