Standard LMs predict the next token. Multi-Token Prediction (DeepSeek V3) predicts the next 2-4 tokens, each with its own head. Trained jointly. At inference: built-in speculative decoding via the extra heads. ~30% throughput gain at no quality cost.
The architecture
# Shared body produces hidden state h
# Head 1: predicts next token (standard)
# Head 2: predicts token+2
# Head 3: predicts token+3
# ...
# Each head: small linear layer (d × V) on the shared hiddenAdds N-1 small projection heads (negligible param count). Each head trained on the same input but a different target offset. At inference: head 1 gives the actual next token; heads 2..N propose speculative continuations.
Training loss
# Sum the losses from all heads:
loss = sum over i in 1..N of CE(head_i(h), target_at_pos+i)Multi-task learning over forward-looking prediction. Doesn't hurt the main head's quality; the auxiliary heads' losses act as regularization and provide alignment for speculative decoding at inference.
Inference speedup
With MTP heads providing K speculative tokens that are verified by the main head — like speculative decoding without a separate draft model. 1.5-2× throughput. Used in DeepSeek V3 by default.
CPU implications
Extra heads = extra small matmuls. Tiny compute overhead. The speedup at inference helps CPU serving significantly. Worth adopting if you can retrain. For existing pretrained models without MTP: external speculative decoding with a small draft works similarly.
Limitations
Quality of distant predictions degrades — head 4 is much weaker than head 1. So K is typically capped at 2-4. The recipe is most effective for fluency-heavy content (text generation, summarization). Less benefit for highly stochastic tasks.