Model leaderboards are loud signal but bad decision-criterion. The actual choice depends on latency, cost at your traffic, terms of service, language support, and feature support (tool use, vision, streaming). Here's the practical framework.
Frontier vs cheap
If your task is genuinely hard (multi-step reasoning, long context, code generation that runs first-try): frontier (Claude, GPT-4-class). If it's narrow and high-volume (classification, extraction, summarization): cheaper (Haiku, Mini-class, or self-hosted SLM).
Self-host vs API
API: fastest to ship, predictable pricing per call. Self-host: cheaper at very high volume (>1M calls/day), full data control, latency control. Hybrid: API for hard tasks, self-host for high-volume narrow tasks.
Things teams forget
Output token cost is usually 3-10x input. Streaming reduces apparent latency but not total cost. Provider's data policy (are your inputs used for training?). Rate limits at the SKU you can actually buy.