Multi-Token Prediction (MTP) was the DeepSeek V3 innovation that mattered. Instead of training the model to predict the next token, train it to predict the next K tokens at once. The speedup at inference is real; the quality trade-off is small. Standard in 2025+ architectures.
Single-token prediction limits
Classical LM: predict token N+1 given tokens 1..N. Each forward pass produces one token. Inference is sequential — latency is N × per-token-latency. Memory-bandwidth-bound on GPUs since each token needs the full model loaded.
MTP at training
Train the model to also predict tokens N+2, N+3, N+4 from position N. Additional small heads on top of the main model. Cheap extra parameters. Main objective still 'predict next'; auxiliary objectives provide stronger supervision and learn lookahead.
MTP at inference — speculative decoding
Use the model's lookahead predictions as draft tokens. Main model verifies them in one pass. If accepted, multiple tokens emitted per forward pass. 1.5-2x speedup typical; up to 3x for predictable text.
Quality trade-off
The extra heads don't hurt the main next-token quality (carefully shown in the DeepSeek V3 paper). Some quality gain possible from the regularization effect. Net: no quality cost, real speed win.
Adoption picture
DeepSeek V3 introduced. Llama 4 includes a variant. Standard in 2025-2026 frontier model design. Inference servers (vLLM, SGLang) are adding native MTP support.