A WebRTC audio frame travels through 8-12 processing stages between mic and speaker. Each stage adds latency, quality changes, or robustness. Knowing them lets you debug 'why does my voice sound robotic' without guessing.

Advertisement

Capture and pre-processing

Mic → APM (Audio Processing Module): echo cancellation, noise suppression, AGC, voice activity detection. Adds ~10ms. Tunable per browser/library; lifelong source of 'why does my voice sound weird' bugs.

Encoding and transport

APM output → Opus encoder (10-20ms frames). Encoded packets → RTP → SRTP encryption → ICE/STUN/TURN for NAT traversal. Total adds 20-50ms baseline.

Advertisement

Decoding and playback

RTP packets → jitter buffer (10-200ms) → Opus decoder → mixer (if multi-party) → output device. Total ear-to-ear latency on a healthy connection: 100-300ms.

APM → Opus → SRTP → jitter buffer → Opus decode → output. Each stage budgets latency; total ~150ms typical.