Speech modeling and classification of noise sources and environments are two methods for improving speech enhancement. Speech enhancement is essential in telecommunication systems with noisy environments especially in conference settings with users at a far distance from the microphone, where the perceptual quality of the communication and the effectiveness of the conference are significantly less than a handset application.
Algorithms for the reduction of noise, such as spectral subtraction, are a popular choice for speech enhancement. These algorithms generally attempt to estimate the noise spectrum without any a priori knowledge, and suppress it from the noisy speech signal to provide a signal-to-noise ratio (SNR) enhancement. As such, the algorithms attempt to be an ubiquitous solution for all types of noise environments. However, due to the wide variety of noise characteristics, many speech enhancement routines fail to significantly improve the overall speech quality and can quite often introduce distortions and other artifacts to the voice portions of the signal. These problems can be avoided if speech enhancements are modified to include some a priori knowledge.
The classification of noise sources uses a priori knowledge of the unique temporal and spectral features that different types of noises present. For example, engine noise from jets and automobiles have a narrowband spectrum; MRI machines or trains have an amplitude modulated shape; while a cocktail party has a rapidly varying wideband spectral shape. Each of these sources have signatures features that present different challenges to standard noise reduction algorithms. A properly identified listening environment can help the shape and adaptation of the noise reduction filter. To accomplish this, a speech enhancement routine can contain a MAP (maximum a posteriori probability) tailored for each acoustic environment. This decision can be aided by a trained Gaussian mixture model for selected temporal and spectral characteristics.
Model based speech enhancement uses a priori knowledge of speech and language modeling. It is based on the predictability of human speech and language such that spectral components of the noisy signature can be removed or preserved (see Maintaining Harmonic Structures for Speech Enhancement). The transition of phonetics or phones can be predicted based on the weighted probability of the previous phones. Also, speaker dependent models can be generated with training on the key phones of a particular speaker. Most commonly, Hidden Markov Models (HMM) are used for the prediction of the temporal and spectral shape of the signal and subsequently, the enhancements of the noise reduction filter.