Speculative Decoding — Belgavi.AI Lab

Advertisement

Acceptance rate 65%

Small draft model proposes K=4 tokens; big model verifies all K in one forward pass.

What you're seeing

Inference is memory-bandwidth-bound: a forward pass uses tens of GB/s on a large model. Each forward pass can verify K speculative tokens for the same memory cost as 1. If acceptance rate is high, you get K tokens per pass instead of 1 — major speedup.

Typical acceptance: 60–80%. Draft model is same-family, ~10× smaller. Net speedup: 1.5–2.5× at no quality cost (verified tokens are exactly what the big model would have generated).

★ KEY TAKEAWAY

Speculative decoding accepts a prefix of drafted tokens that the big model agrees with. Free speedup when draft is decent.

▶ WHAT TO TRY

Slide Acceptance rate from 10% to 95%.
Click Generate next batch to see typical accept patterns.