Advertisement
Active params per token = K × params/expert. Total = N × params/expert. Memory = total.
What you're seeing
Router decides per-token routing. Without balancing loss: dead experts.
★ KEY TAKEAWAY
MoE routes each token to top-K experts. Total params high, active compute low. Memory cost = all experts must be loaded.
▶ WHAT TO TRY
- Click Route 16 tokens repeatedly — watch load balance vary.
- Without an auxiliary balance loss, some experts go unused and others overload.