MoE — multiple specialist sub-networks gated by a router — went from research to standard. Mixtral 8x22B, DeepSeek V2, GPT-4 family. The win is compute efficiency: a model with 200B params that only activates 30B per token.
Why it works
Most tokens don't need all parameters. Router picks 1-2 experts per token from a pool. Result: better quality per FLOP, same quality at lower inference cost. Inference cost scales with active params, not total.
Memory cost vs compute cost
All experts must be in memory even though only 2 activate. So MoE is fast but RAM-heavy. Trades GPU memory for GPU compute. Right for inference servers with VRAM headroom.
Routing fragility
Bad router decisions hurt quality. Training stability harder than dense models. Inference batch sizes are awkward (different experts hit by different tokens). Deployment complexity higher.