Multi-Token Prediction (MTP)

Standard LMs predict the next token. Multi-Token Prediction (DeepSeek V3) predicts the next 2-4 tokens, each with its own head. Trained jointly. At inference: built-in speculative decoding via the extra heads. ~30% throughput gain at no quality cost.

Advertisement

The architecture

# Shared body produces hidden state h
# Head 1: predicts next token (standard)
# Head 2: predicts token+2
# Head 3: predicts token+3
# ...

# Each head: small linear layer (d × V) on the shared hidden

Adds N-1 small projection heads (negligible param count). Each head trained on the same input but a different target offset. At inference: head 1 gives the actual next token; heads 2..N propose speculative continuations.

Training loss

# Sum the losses from all heads:
loss = sum over i in 1..N of CE(head_i(h), target_at_pos+i)

Multi-task learning over forward-looking prediction. Doesn't hurt the main head's quality; the auxiliary heads' losses act as regularization and provide alignment for speculative decoding at inference.

Advertisement

Inference speedup

With MTP heads providing K speculative tokens that are verified by the main head — like speculative decoding without a separate draft model. 1.5-2× throughput. Used in DeepSeek V3 by default.

CPU implications

Extra heads = extra small matmuls. Tiny compute overhead. The speedup at inference helps CPU serving significantly. Worth adopting if you can retrain. For existing pretrained models without MTP: external speculative decoding with a small draft works similarly.

Limitations

Quality of distant predictions degrades — head 4 is much weaker than head 1. So K is typically capped at 2-4. The recipe is most effective for fluency-heavy content (text generation, summarization). Less benefit for highly stochastic tasks.

MTP: extra heads predict next 2-4 tokens. Free speculative decoding. 30% inference speedup. DeepSeek V3 standard.