Call Today 716.688.4675

Psychoacoustic Echo Cancellation

Psychoacoustic Echo Cancellation can be used to improve the computational efficiency and the subjective output of acoustic echo cancellation (AEC) by considering how humans perceive sound. Acoustic echo cancellers that take into account the perceptual abilities of human listeners seem to be few and far between. This is partly due to a lack of data regarding these abilities for system designers in a specific problem domain. Perhaps it is merely engineers and psychologists rarely talk to each other. Nevertheless, current approaches in perceptually motivated  acoustic echo cancellation algorithms have improved on methods used in conventional psychoacoustic echo cancellation systems.

Filter Bank Optimization

Conventional sub-band acoustic echo cancellation typically creates a linear or logarithmic distribution of filter banks. In contrast, the psychoacoustic approach to filter bank creation is to model the distribution according to perceptual sensitivities of the human auditory system. By examining the Bark scale [1], it can be seen the perceptually important frequency bands are not linear nor logarithmically distributed, but instead, follow a sigmoidal distribution. This unique distribution is a result of the physiology of our inner ears, and seems to be optimized to the shape of the power spectral density of a typical speech signal. For both males and females, most of the power in human speech is concentrated in the lower frequency sub-bands, with less power present as frequency increases.

Going further, optimal tap profiles can be created by considering this unique distribution of power. With such a spectrum in mind, it seems logical that an optimal tap profile may be achieved by concentrating the computational power of your adaptive echo cancellation filter within the lower frequency sub-bands, instead of uniformly distributing the taps across the filter banks. Unfortunately for the engineer, the human auditory system is not that simple.

For instance, when considering hearing sensitivity, optimal tap profiles may be obtained via tap concentration in the mid-range frequency bands. To take advantage of both speech power and hearing sensitivity, it might be advantageous to create a weighting profile that has a global maximum in the lower sub-bands but retains a local maximum in the mid-range sub bands. By considering the canonical power spectral densities of the most frequently occurring speech sounds occurring in a given language, this profile can be further modified for a perceptually optimal solution.

Post-Filtering via Perceptual Masking

Conventional methods may attempt to reduce the residual echo via a non-linear post-filter. Such a processor may work by attenuating residual echo that surpasses a pre-defined threshold. Mostly, this conventional approach assumes that double-talk is not present, and thus this post-filtering becomes state dependent.  When double-talk is present however, the conventional approach is to simply freeze the algorithm, thereby implicitly assuming that the near end speech is going to mask the residual echo. Problems arise when this assumption fails, but fortunately we can use the very same properties of auditory masking to our advantage. For instance, if we have access, a-priori or otherwise, to an accurate representation of the near-end speech signal, we may be able to weight the residual echo such that it’s spectrum will be masked by the spectrum of the near end speech signal. In this way, the assumption of near end masking will not fail.

If we do not have access to such information, perhaps we can turn the non-linear processor into a comfort noise generator. By examining the spectrum of the residual echo versus our noise profile of choice, we can create a weighting function that will ensure the perceptual masking of the residual echo. By using a pleasant sounding noise signal, we can create a natural sounding output. To learn more about perceptual masking, please see our article entitled Psychoacoustic Noise Suppression.

Perceptual Echo Control

Christof Faller and Jingdong Chen developed an approach in [2] that is a complete departure from traditional acoustic echo cancellation with motivation derived from human perceptual qualities. Instead of estimating the actual echo signal, their algorithm simply estimates the envelope (aka the spectral power) of the echo signal. In combination with a perceptual weighting rule in a perceptually motivated sub-band decomposition, they are able to drastically reduce to precision needed in some frequency sub-bands. Using this estimate, they can suppress the echo in the microphone signal while still maintaining the full quality and loudness of the near end talker.

This approach has significant advantages over traditional methods. For instance, the need for a non-linear processor can be completely eliminated due to the phase insensitivity and coarse resolution of the spectral envelope. In addition, the envelope is calculated much more easily than the signal itself, and presents significant stability improvements by eliminating time jitter and improving high frequency resolution.

Simultaneous Echo and Noise Suppression

Gustafsson et al. [3] used a psychoacoustically motivated weighting rule for an acoustic echo cancellation post-filter to suppress  residual echo signals as well as noise. This post filter weighting was designed as a linear combination of noise and echo attenuation factors that optimally preserved the background noise characteristics but at a reduced level, while reducing the echo to such an extent that it was masked by this reduced background noise and the near end speech.

In performing unit and system tests, they found robust echo cancellation, while subjective listening tests clearly showed a strong preference for this perceptually motivated approach over the traditional approach. In addition, they were able to reduce the demands on the adaptive filter leading to computational savings, faster convergence, and a convergence that is more stable in the face of the highly non-stationary noise typically present in mobile telephony applications.


[1] E. Zwicker, “Subdivision of the Audible Frequency Range into Critical Bands”, The
Journal of the Acoustical Society of America
vol. 33, no. 2, February 1961, pp. 248

[2] C. Faller, J. Chen, “Suppressing Acoustic Echo in a Spectral Envelope Space”, IEEE Transactions on Speech and Audio Processing, 2005, pp. 1-15.

[3] S. Gustafsson, “A Psychoacoustic Approach to Combined Acoustic Echo Cancellation and Noise Reduction”‘ IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, July 2002, pp. 245-256

More Information

VOCAL Technologies, Ltd.
520 Lee Entrance, Suite 202
Amherst New York 14228
Phone: +1-716-688-4675
Fax: +1-716-639-0713