On-Device LLM Inference — Belgavi.AI Lab

On-device LLM inference moved from demo to default for many features in 2025-2026. The privacy story alone justifies it, but cost and latency seal it. Knowing what runs where, with which framework, and with what trade-offs is now baseline mobile/desktop knowledge.

Advertisement

Phone tier (3-8GB RAM)

Apple's Foundation Models (on-device 3B). Google's Gemini Nano on Pixel. Generic: Qwen 2.5 1.5B, Phi-3-mini. Quantized to Q4. Inference 5-15 tokens/sec on modern phones. Right for: short summaries, autocomplete, tone shifting.

Laptop tier (16-32GB RAM)

Llama 3 8B, Mistral 7B, Qwen 2.5 7B at Q4-Q5. 20-50 tokens/sec on Apple Silicon M3+, similar on RTX 3060+. Right for: serious coding assistance, summarization, RAG with 32K context.

Advertisement

Workstation tier (48GB+)

Llama 70B Q4, Qwen 72B Q4, Mixtral 8x22B. 5-20 tokens/sec depending on GPU. Right for: research, batch processing of documents, locally-hosted assistants.

Frameworks

llama.cpp/Ollama: cross-platform, CPU+GPU, easy. MLX (Apple): fastest on Apple Silicon. TensorRT-LLM (NVIDIA): fastest on RTX. Core ML / TFLite: phones. Pick by platform.

What to expect

Cold start matters (model load: 1-10 sec). Context length costs RAM linearly (8K is fine; 32K+ is heavy). Quality on small models won't match GPT-4 on hard tasks but matches it on common tasks for many users.

Phones: 1.5B-3B Q4. Laptops: 7-8B Q4. Workstations: 70B Q4. Pick framework by platform. Expect cold-start cost.