Advertisement
Small draft model proposes K=4 tokens; big model verifies all K in one forward pass.
What you're seeing
Inference is memory-bandwidth-bound: a forward pass uses tens of GB/s on a large model. Each forward pass can verify K speculative tokens for the same memory cost as 1. If acceptance rate is high, you get K tokens per pass instead of 1 — major speedup.
Typical acceptance: 60–80%. Draft model is same-family, ~10× smaller. Net speedup: 1.5–2.5× at no quality cost (verified tokens are exactly what the big model would have generated).
★ KEY TAKEAWAY
Speculative decoding accepts a prefix of drafted tokens that the big model agrees with. Free speedup when draft is decent.
▶ WHAT TO TRY
- Slide Acceptance rate from 10% to 95%.
- Click Generate next batch to see typical accept patterns.