MoE — multiple specialist sub-networks gated by a router — went from research to standard. Mixtral 8x22B, DeepSeek V2, GPT-4 family. The win is compute efficiency: a model with 200B params that only activates 30B per token.

Advertisement

Why it works

Most tokens don't need all parameters. Router picks 1-2 experts per token from a pool. Result: better quality per FLOP, same quality at lower inference cost. Inference cost scales with active params, not total.

Memory cost vs compute cost

All experts must be in memory even though only 2 activate. So MoE is fast but RAM-heavy. Trades GPU memory for GPU compute. Right for inference servers with VRAM headroom.

Advertisement

Routing fragility

Bad router decisions hurt quality. Training stability harder than dense models. Inference batch sizes are awkward (different experts hit by different tokens). Deployment complexity higher.

MoE = active-params-cheap, total-params-RAM-heavy. Good for memory-rich inference. Routing is the fragile bit.