Multi-Token Prediction (MTP)

Multi-Token Prediction (MTP) was the DeepSeek V3 innovation that mattered. Instead of training the model to predict the next token, train it to predict the next K tokens at once. The speedup at inference is real; the quality trade-off is small. Standard in 2025+ architectures.

Advertisement

Single-token prediction limits

Classical LM: predict token N+1 given tokens 1..N. Each forward pass produces one token. Inference is sequential — latency is N × per-token-latency. Memory-bandwidth-bound on GPUs since each token needs the full model loaded.

MTP at training

Train the model to also predict tokens N+2, N+3, N+4 from position N. Additional small heads on top of the main model. Cheap extra parameters. Main objective still 'predict next'; auxiliary objectives provide stronger supervision and learn lookahead.

Advertisement

MTP at inference — speculative decoding

Use the model's lookahead predictions as draft tokens. Main model verifies them in one pass. If accepted, multiple tokens emitted per forward pass. 1.5-2x speedup typical; up to 3x for predictable text.

Quality trade-off

The extra heads don't hurt the main next-token quality (carefully shown in the DeepSeek V3 paper). Some quality gain possible from the regularization effect. Net: no quality cost, real speed win.

Adoption picture

DeepSeek V3 introduced. Llama 4 includes a variant. Standard in 2025-2026 frontier model design. Inference servers (vLLM, SGLang) are adding native MTP support.

Train extra heads predicting future tokens. Use them as draft tokens at inference. 1.5-2x speed, no quality cost. Standard in 2026.