Knowledge distillation — train a small model to mimic a large one — has gone from research curiosity to production technique. Done right, you get GPT-4-quality on a 7B model for your specific task at 50x lower inference cost.
Teacher-student basics
Run prompts through the teacher (e.g., Claude or GPT-4). Capture outputs. Train the student on (prompt, teacher_output) pairs. Optionally, use soft labels (logits) for richer training signal.
Soft labels vs hard labels
Hard labels: just the output text. Soft labels: full token distribution from teacher. Soft labels train ~30% better student models, but require teacher API support for logits (often unavailable).
Data quantity guidance
Domain task: 5K-50K examples for a 7B student. Open-ended chat: 100K+. Quality matters more than quantity — bad teacher outputs anchor the student to bad behavior. Filter aggressively.
Curriculum and chain-of-thought
Capture teacher's reasoning chain, not just final answer. Train student to reproduce reasoning + answer. Small models with explicit CoT match much larger zero-shot models on benchmarks.
Cost picture
Capturing 50K teacher completions: ~$200-1000 in API calls. Fine-tuning a 7B model: ~$50-200 in GPU time. Hosting the 7B: ~10x cheaper than calling teacher per request. Pays back in days for any serious traffic.