▶ Interactive Lab

Speculative Decoding

Draft model proposes K tokens; main model verifies in parallel.

Advertisement
Small draft model proposes K=4 tokens; big model verifies all K in one forward pass.

What you're seeing

Inference is memory-bandwidth-bound: a forward pass uses tens of GB/s on a large model. Each forward pass can verify K speculative tokens for the same memory cost as 1. If acceptance rate is high, you get K tokens per pass instead of 1 — major speedup.

Typical acceptance: 60–80%. Draft model is same-family, ~10× smaller. Net speedup: 1.5–2.5× at no quality cost (verified tokens are exactly what the big model would have generated).

★ KEY TAKEAWAY
Speculative decoding accepts a prefix of drafted tokens that the big model agrees with. Free speedup when draft is decent.
▶ WHAT TO TRY
  • Slide Acceptance rate from 10% to 95%.
  • Click Generate next batch to see typical accept patterns.