Acoustic Echo Cancellation (AEC) removes your own voice (played from the speaker) from the microphone input before sending to the far end. Without it, every speakerphone call becomes a feedback loop. Modern AEC uses adaptive filters that learn the acoustic path from speaker to mic and subtract the echo in real time.
The signal model
Far-end signal x(n) plays through speaker → room acoustics → reaches mic with delay and reverberation. Near-end signal d(n) (mic) = h(n)*x(n) + s(n) + noise, where h(n) is the unknown impulse response of the room and s(n) is the near-end speaker. AEC estimates h(n) and subtracts.
NLMS adaptive filter
Normalized Least Mean Squares: update filter coefficients ĥ(n) proportional to error × normalized input. Converges quickly, low complexity. μ=0.2 typical. Filter length matches expected reverb time (e.g., 200ms @ 16kHz = 3200 taps).
Double-talk detection (DTD)
When both far-end and near-end speak simultaneously, NLMS would 'unlearn' the room model. A DTD freezes adaptation during double-talk. Common method: compare normalized cross-correlation of mic and reference — high correlation = no double-talk, low = both talking.
Frequency-domain implementation
Time-domain filter with N taps costs O(N) per sample. Frequency-domain (Partitioned Block Convolution) costs O(log N) per sample via FFT. WebRTC's AEC3 uses frequency-domain with multiple delay partitions — essential for long reverberation tails.
Residual echo + AES
Linear AEC removes ~30-40 dB of echo. Remaining residual (mainly nonlinear distortion from cheap speakers) goes through Acoustic Echo Suppressor (AES) — frequency-domain gain reduction during far-end activity. Together: 50-60 dB echo reduction, indistinguishable from a non-echoey call.