Mixture of Experts — Math and CPU Implications

MoE models have many experts but only activate a few per token. Total params can be 10× a dense model with similar compute. For CPU inference this is a mixed blessing: lower compute helps, but all experts must be loaded in RAM.

Advertisement

Router math

# For each token x:
router_logits = x · W_router   # shape: [num_experts]
probs = softmax(router_logits)
top_k = argtop_k(probs, k=2)

# Process by top-k experts only:
output = sum over i in top_k of probs[i] * Expert_i(x)

Sparse activation: only k of N experts compute per token. k=2 is common (Mixtral). Router is a tiny linear layer; experts are full FFN blocks.

Active vs total parameters

# Mixtral 8x7B:
# Total: 47B parameters (8 experts of ~6B each)
# Active per token: 13B (2 experts × ~6B + shared layers)
# Memory: must load all 47B
# Compute: ~ same as a dense 13B

Quality of a much larger model, compute of a smaller one. Trade memory for compute. For CPU inference: still need 47B in RAM, but inference is faster than a dense 47B model.

Advertisement

Load balancing

Without intervention, the router may pick the same experts often → some experts dead, others overloaded. Auxiliary 'load balance' loss penalizes imbalanced routing during training. Standard in all open MoE.

CPU inference considerations

Memory pressure: all experts loaded. Hard limit. INT4 quantization helps but Mixtral 8x7B is still 24 GB at Q4. Compute helps less than for GPUs (CPU compute is already cheap relative to memory access). MoE on CPU works but is workload-dependent.

Smaller MoE for SLM?

DeepSeek-V3 and Qwen MoE go down to 3B active params. The memory-compute trade still applies. For laptops with 16-32 GB RAM: dense models are usually more practical than MoE. MoE shines on workstation+ class hardware.

MoE: many experts, few active per token. Memory cost = all experts; compute = active only. Limited CPU benefit.