Advertisement
Draft proposes K tokens. Big model verifies in parallel. Keep accepted prefix.
What you're seeing
Acceptance rate 60-80% typical. K=4 with 70% accept → ~3 tokens per big-model step → ~3× speedup.
★ KEY TAKEAWAY
Speculative decoding: draft proposes K tokens, big model verifies in parallel. 60-80% acceptance → 1.5-3× speedup at zero quality cost.
▶ WHAT TO TRY
- Slide Accept rate from 20% to 95%.
- Click Run cycle repeatedly to see typical acceptance patterns.
- The math guarantees the output distribution matches the big model exactly.