Voice agents are the highest-stakes UX for ADKs — every latency millisecond matters, every misrecognition embarrasses. The pipeline is conceptually simple (STT → LLM → TTS) but production-ready voice agents have 8-12 components and a tight latency budget.

Advertisement

Latency budget

Target: <500ms from user stops speaking to agent starts speaking. Budget: VAD (50ms) + STT finalize (100ms) + LLM TTFT (150ms) + TTS first audio (100ms) + transport (50ms). Anything over 500ms feels laggy.

Streaming everywhere

Don't wait for full STT result; start LLM on partial transcripts (with debounce). Don't wait for full LLM response; stream TTS on partial text. Interruption handling: VAD on user side cancels in-flight LLM/TTS.

Advertisement

End-of-utterance detection

Silence threshold after detected speech. 500-800ms typical. Too short: cuts user off. Too long: laggy response. The single highest-leverage UX tuning. Adapt per language and conversation type.

State and tool use

Voice agents call tools (book appointment, check status). Tool descriptions matter doubly: model has to pick the right tool fast (latency budget). Pre-warm common tools' state to avoid cold-start in the loop.

Failure modes

STT misrecognition (especially names, addresses). TTS mispronunciation. LLM hallucinating tool args. Network latency spikes. Each needs a recovery: clarify, spell out, verify, retry. Real voice agents have these built in or feel brittle.

Streaming pipeline, &lt;500ms budget, smart end-of-utterance, tool design for speed, recovery for misrecognition. Voice agents are unforgiving.