Public benchmarks (MMLU, MT-Bench, HumanEval) are useful for vendor comparison and almost useless for picking a model for your task. Your task-specific eval is the only metric that matters once a model is in production candidate range.
Public benchmark hygiene
MMLU saturates above 85%; gaps between leading models are noise. HumanEval is contaminated. MT-Bench score correlates better with chat preference. Pick by recent benchmark + clean leaderboard (LMSYS, HELM).
Task-specific eval design
100-500 examples representative of production. Mix easy + hard + adversarial. Grade with LLM-as-judge calibrated against humans. Rubric clear enough that two graders agree.
Continuous eval in production
Sample 1% of prod traffic. Grade asynchronously (human, model, or both). Track regression weekly. Re-eval whenever model or prompt changes.