Knowing the architecture choices of the top SLMs helps you pick one for CPU inference and understand what makes them tick. Phi-3, Qwen 2.5, Gemma 2 dominate the 1-9B size class. Their hyperparams differ in revealing ways.
Phi-3-mini (3.8B)
d_model: 3072
n_layers: 32
n_heads: 32 (no GQA)
d_ff: 8192 (SwiGLU)
vocab: 32064
context: 4K → 128K (LongRope)
norm: RMSNormMicrosoft's bet on synthetic high-quality data. No GQA — surprising for a 3.8B model. Excellent reasoning and code for size. Strongest in low-context regimes.
Qwen 2.5-3B
d_model: 2048
n_layers: 36 (deeper!)
n_heads_q: 16
n_heads_kv: 2 (8x GQA)
d_ff: 11008
vocab: 152064 (multilingual)
context: 32K nativeAlibaba's strong multilingual model. Deep + thin. Aggressive GQA (8× compression). Larger vocab for Chinese/multilingual coverage. Strong tool use.
Gemma 2-2B
d_model: 2304
n_layers: 26
n_heads_q: 8
n_heads_kv: 4 (2x GQA)
d_ff: 9216
vocab: 256000 (very large)
context: 8K
norm: RMSNormGoogle's open Gemini-derivative. Huge vocab (256K) — better multilingual + less tokenization waste. Tied to Gemma chat tuning approach.
Common patterns
All use RMSNorm, SwiGLU, RoPE. All decoder-only. All trained on next-token prediction. Differences in: depth/width ratio, KV head count, vocab size, training data. The architecture has stabilized; data and post-training are now the differentiation.
Choosing for CPU inference
# Speed (smaller, simpler): Gemma 2 2B > Qwen 2.5 3B > Phi-3 mini
# Quality (per param): Phi-3 mini > Qwen 2.5 3B > Gemma 2 2B
# Multilingual: Qwen 2.5 > Gemma 2 > Phi-3
# Code: Phi-3 > Qwen 2.5 > Gemma 2
# Long context: Phi-3 LongRope > Qwen 2.5 > Gemma 2Workload should pick the model. Don't blindly use 'whatever's popular'. Run a small eval (50-100 examples) on your domain before committing.