
Speech enhancement is a requirement in telecommunication systems in noisy environments. In conference settings that have users a far distance from the microphone, the perceptual quality of the communication and the effectiveness of the conference are significantly less than a handset application. Noise reduction algorithms, such as spectral subtraction, are popular choice for speech enhancement. These algorithms generally attempt to estimate the noise spectrum without any a priori knowledge, and suppress it from the noisy speech signal to provide a signal-to-noise ratio (SNR) enhancement. These algorithms attempt to be an ubiquitous solution for all types of noise environments. Due to the wide variety of noise characteristics, many speech enhancement routines fail to significantly improve the overall speech quality and can quite often introduce distortions to the voice portions of the signal. These problems can be avoid if speech enhancements are modified to include some a priori knowledge.
There are two methods of using a priori knowledge for improving speech enhancement routines. They are the classification of noise sources and environments, and speech modeling. The classification of noise sources uses the unique temporal and spectral features that different types of noises present. For example, engine noise from jets and automobiles have a narrowband spectrum. While, MRI machines, or trains have a amplitude modulated shape, and a cocktail party environments has a rapidly varying wideband spectral shape. Each of these sources have signatures features that present different challenges to standard noise reduction algorithms. A properly identified listening environment can help the shape and adaptation of the noise reduction filter. To accomplish this a speech enhancement routine can contain a (maximum a posteriori probability) MAP for each acoustic environment. This decision can be aided by a trained Gaussian mixture model for selected temporal and spectral characteristics.
The second method for using a priori knowledge uses speech and language modeling. Based on the predictability of human speech and language, spectral components of the noisy signature can be removed or preserved. The transition of phonetics or phones, can be predicted based on the weighted probability of the previous phones. Also, speaker dependent models can be generated with training on the keys phones of a particular speaker. Most commonly, Hidden Markov Models (HMM) are used for the prediction of the temporal and spectral shape of the signal and subsequently, the enhancements of the noise reduction filter.