Distilling a small model from a strong teacher is a well-understood path to cheap inference. The recipe is straightforward; the quality bar is set by data preparation. Teams that skip filtering get mediocre results; teams that filter aggressively get small models that beat their teacher on the specific task.
Generate 5x what you need
Plan for 10K-100K training examples. Generate 50K-500K candidate completions from the teacher. The filter ratio depends on teacher quality; budget for 5x more generations than you'll use.
Capture chain-of-thought
Don't just capture final answers. Prompt teacher to think step-by-step; capture the reasoning. Train student on (prompt, reasoning, answer) triples. Small models with explicit CoT match much larger zero-shot models on benchmarks.
Filter ruthlessly
Reward model scoring (use a strong judge model). For code: compile + run + assert. For math: check answer. For extraction: validate format. For summarization: LLM-as-judge. Drop the bottom 50-80%. The remaining data is the gold.
Diversity through varied prompts
Same prompt 1000 times = mode collapse. Vary topic, style, length, difficulty, format. Cluster generated examples by embedding; ensure coverage. Don't rely on temperature alone for diversity.
Two-pass distillation
Train student on filtered teacher data (pass 1). Have student generate completions on new prompts; have teacher critique; train on the critiques (pass 2). Pass 2 often gives 5-10% quality boost.