Phi-4 (14B) hits a sweet spot for fine-tuning: small enough to QLoRA on one GPU, strong enough to compete with much larger models on benchmarks. The fine-tuning recipe is straightforward but easy to get wrong; here's the path that works.

Advertisement

Data preparation

500-5000 examples for narrow tasks. Format as instruction-response (Phi's preferred), or chat (also supported). Include chain-of-thought reasoning for tasks needing it — Phi was trained on CoT-heavy data and responds well to this format.

QLoRA setup

4-bit base (NF4), LoRA adapters on attention + MLP projections, rank=16-32, alpha=32-64. Fits in ~24GB GPU memory. Use bitsandbytes + peft. Standard recipe; minimal tuning needed.

Advertisement

Training hyperparameters

LR=2e-4 (LoRA can use higher LR than full fine-tune). Cosine schedule with 100 warmup steps. Batch size = whatever fits + gradient accumulation. 3-5 epochs for narrow tasks. Watch eval loss for over-fit (Phi is small enough to over-fit fast).

Evaluation along the way

Hold out 10% for eval. Compute metrics every epoch. Best checkpoint by eval metric, not final. Phi often peaks at epoch 2-3 and degrades after.

Inference deployment

Merge LoRA weights into base for production (eliminates the adapter forward pass overhead). Quantize merged model to INT4 if memory-constrained. vLLM or llama.cpp for serving. Expect 50+ tokens/sec on RTX 4090, 200+ on H100.

Phi-4 + QLoRA + 1-5K examples + 3-5 epochs + eval each step. The recipe is straightforward; data quality is the gate.