Acoustic Echo Cancellation (AEC) removes your own voice (played from the speaker) from the microphone input before sending to the far end. Without it, every speakerphone call becomes a feedback loop. Modern AEC uses adaptive filters that learn the acoustic path from speaker to mic and subtract the echo in real time.

Advertisement

The signal model

Far-end signal x(n) plays through speaker → room acoustics → reaches mic with delay and reverberation. Near-end signal d(n) (mic) = h(n)*x(n) + s(n) + noise, where h(n) is the unknown impulse response of the room and s(n) is the near-end speaker. AEC estimates h(n) and subtracts.

NLMS adaptive filter

Normalized Least Mean Squares: update filter coefficients ĥ(n) proportional to error × normalized input. Converges quickly, low complexity. μ=0.2 typical. Filter length matches expected reverb time (e.g., 200ms @ 16kHz = 3200 taps).

Advertisement

Double-talk detection (DTD)

When both far-end and near-end speak simultaneously, NLMS would 'unlearn' the room model. A DTD freezes adaptation during double-talk. Common method: compare normalized cross-correlation of mic and reference — high correlation = no double-talk, low = both talking.

Frequency-domain implementation

Time-domain filter with N taps costs O(N) per sample. Frequency-domain (Partitioned Block Convolution) costs O(log N) per sample via FFT. WebRTC's AEC3 uses frequency-domain with multiple delay partitions — essential for long reverberation tails.

Residual echo + AES

Linear AEC removes ~30-40 dB of echo. Remaining residual (mainly nonlinear distortion from cheap speakers) goes through Acoustic Echo Suppressor (AES) — frequency-domain gain reduction during far-end activity. Together: 50-60 dB echo reduction, indistinguishable from a non-echoey call.

NLMS + DTD + frequency-domain + AES residual suppressor. WebRTC AEC3 is the open-source reference; don't roll your own from scratch.