Knowing the architecture choices of the top SLMs helps you pick one for CPU inference and understand what makes them tick. Phi-3, Qwen 2.5, Gemma 2 dominate the 1-9B size class. Their hyperparams differ in revealing ways.

Advertisement

Phi-3-mini (3.8B)

d_model:      3072
n_layers:     32
n_heads:      32 (no GQA)
d_ff:         8192 (SwiGLU)
vocab:        32064
context:      4K → 128K (LongRope)
norm:         RMSNorm

Microsoft's bet on synthetic high-quality data. No GQA — surprising for a 3.8B model. Excellent reasoning and code for size. Strongest in low-context regimes.

Qwen 2.5-3B

d_model:      2048
n_layers:     36 (deeper!)
n_heads_q:    16
n_heads_kv:   2 (8x GQA)
d_ff:         11008
vocab:        152064 (multilingual)
context:      32K native

Alibaba's strong multilingual model. Deep + thin. Aggressive GQA (8× compression). Larger vocab for Chinese/multilingual coverage. Strong tool use.

Advertisement

Gemma 2-2B

d_model:      2304
n_layers:     26
n_heads_q:    8
n_heads_kv:   4 (2x GQA)
d_ff:         9216
vocab:        256000 (very large)
context:      8K
norm:         RMSNorm

Google's open Gemini-derivative. Huge vocab (256K) — better multilingual + less tokenization waste. Tied to Gemma chat tuning approach.

Common patterns

All use RMSNorm, SwiGLU, RoPE. All decoder-only. All trained on next-token prediction. Differences in: depth/width ratio, KV head count, vocab size, training data. The architecture has stabilized; data and post-training are now the differentiation.

Choosing for CPU inference

# Speed (smaller, simpler):     Gemma 2 2B > Qwen 2.5 3B > Phi-3 mini
# Quality (per param):           Phi-3 mini > Qwen 2.5 3B > Gemma 2 2B
# Multilingual:                  Qwen 2.5 > Gemma 2 > Phi-3
# Code:                          Phi-3 > Qwen 2.5 > Gemma 2
# Long context:                  Phi-3 LongRope > Qwen 2.5 > Gemma 2

Workload should pick the model. Don't blindly use 'whatever's popular'. Run a small eval (50-100 examples) on your domain before committing.

Phi: dense, quality. Qwen: GQA + multilingual. Gemma: huge vocab. Architecture is stable; data + tuning differentiate.