VAD tells you whether the current audio frame contains speech. Used everywhere: skip transmitting silence (bandwidth), trigger STT (wake word), detect end-of-utterance (turn-taking in voice agents). The 2026 baseline is neural — much better than energy threshold.

Advertisement

Energy-threshold VAD

Compare frame energy to background noise floor. Cheap, but fooled by stationary noise (fans, traffic). Acceptable for bandwidth savings; not for triggering ASR.

Neural VAD (Silero, py-webrtcvad)

Small CNN/RNN classifies frames. ~5-10ms per frame, near-zero false positives on stationary noise. Default in voice agents (LiveKit, Whisper streaming).

Advertisement

End-of-utterance detection

Silence duration after detected speech. Tune by language and intent: English commands ~500ms, conversational ~800-1200ms. Too short = cuts user off; too long = laggy agent. Often the highest-leverage UX tuning in a voice agent.

Silero VAD for triggering; energy VAD for bandwidth. End-of-utterance silence threshold is the UX lever.