On-device LLM inference moved from demo to default for many features in 2025-2026. The privacy story alone justifies it, but cost and latency seal it. Knowing what runs where, with which framework, and with what trade-offs is now baseline mobile/desktop knowledge.
Phone tier (3-8GB RAM)
Apple's Foundation Models (on-device 3B). Google's Gemini Nano on Pixel. Generic: Qwen 2.5 1.5B, Phi-3-mini. Quantized to Q4. Inference 5-15 tokens/sec on modern phones. Right for: short summaries, autocomplete, tone shifting.
Laptop tier (16-32GB RAM)
Llama 3 8B, Mistral 7B, Qwen 2.5 7B at Q4-Q5. 20-50 tokens/sec on Apple Silicon M3+, similar on RTX 3060+. Right for: serious coding assistance, summarization, RAG with 32K context.
Workstation tier (48GB+)
Llama 70B Q4, Qwen 72B Q4, Mixtral 8x22B. 5-20 tokens/sec depending on GPU. Right for: research, batch processing of documents, locally-hosted assistants.
Frameworks
llama.cpp/Ollama: cross-platform, CPU+GPU, easy. MLX (Apple): fastest on Apple Silicon. TensorRT-LLM (NVIDIA): fastest on RTX. Core ML / TFLite: phones. Pick by platform.
What to expect
Cold start matters (model load: 1-10 sec). Context length costs RAM linearly (8K is fine; 32K+ is heavy). Quality on small models won't match GPT-4 on hard tasks but matches it on common tasks for many users.