Distillation for Production Inference

Knowledge distillation — train a small model to mimic a large one — has gone from research curiosity to production technique. Done right, you get GPT-4-quality on a 7B model for your specific task at 50x lower inference cost.

Advertisement

Teacher-student basics

Run prompts through the teacher (e.g., Claude or GPT-4). Capture outputs. Train the student on (prompt, teacher_output) pairs. Optionally, use soft labels (logits) for richer training signal.

Soft labels vs hard labels

Hard labels: just the output text. Soft labels: full token distribution from teacher. Soft labels train ~30% better student models, but require teacher API support for logits (often unavailable).

Advertisement

Data quantity guidance

Domain task: 5K-50K examples for a 7B student. Open-ended chat: 100K+. Quality matters more than quantity — bad teacher outputs anchor the student to bad behavior. Filter aggressively.

Curriculum and chain-of-thought

Capture teacher's reasoning chain, not just final answer. Train student to reproduce reasoning + answer. Small models with explicit CoT match much larger zero-shot models on benchmarks.

Cost picture

Capturing 50K teacher completions: ~$200-1000 in API calls. Fine-tuning a 7B model: ~$50-200 in GPU time. Hosting the 7B: ~10x cheaper than calling teacher per request. Pays back in days for any serious traffic.

Capture teacher outputs (with CoT, ideally soft labels), filter, fine-tune student. ROI is usually days for production traffic.