A Hidden Markov Model (HMM) offers important results for noise reduction in speech enhancement applications for scenarios beyond the ones handled by Spectral Subtraction where distractors are assumed to be stationary signals.
The HMM approach includes re-using audio data related to speech and distractors for the purpose enhancing polluted speech. Regarding English speech, one of the typical approaches is to use TIMIT. TIMIT is a database (or corpus) phonemically and lexically transcribed clean speech of American English male and female speakers. Specific implementation of HMM-based voice enhancement system may use a small subset to TIMIT; for better performance some implementations of the HMM-based voice enhancement system use larger subsets of TIMIT. For narrow-band HMM-based voice enhancement systems, NTIMIT database (Network TIMIT) is typically used.
A noise database is also required for the HMM-based voice enhancement system and, to cover greater scenarios of polluted speech, the noise database is typically updated as the need arises.
A block diagram of one of the advanced implementations of the HMM-based voice enhancement systems is shown in Figure 1. One of the blocks from Figure 1, NOISE MODEL ADAPTATION, is shown in greater detail in Figure. 2.
The signal processing flow is as follows (cf. Ref.):
- The noisy speech is first preprocessed; autocorrelation coefficients of each frame of noisy speech signals are extracted and are fed to the noise adaptation block.
- The non-speech intervals of the noisy speech are based on a VAD algorithm. In one of the advanced version, the non-speech intervals of the noise speech signal are detected using a Viterbi forward algorithm that runs pre-selected different types of noise models.
- The likelihood for each noise model is computed and the model associated with the highest likelihood is selected.
- Using the selected HMM parameters and the clean speech model, the pre-processed noisy speech signal is fed to the MMSE forward algorithm block that generates the weights of the Wiener filters.
- Concurrently, all Wiener filters for each combination of the state and mixture pairs in the speech and noise models are computed. A single weighted filter is constructed for each frame of noisy speech using the computed filter weighs and the pre-trained Wiener filters.
- The filtering of the noise signal is carried out using the weighted filter. The output is the spectral magnitude of the enhanced speech signals. Using these magnitudes together with the noise speech’s phase information (as in the block SYNTHESIS BY OVERLAP-ADD METHOD), an inverse FFT is performed to obtain the time-domain enhanced speech.
Other advanced model-based methods of the noisy speech enhancements include a method that uses an AR Model-Based Kalman Filtering. This method is to be a topic of a separate note.
VOCAL’s Voice Enhancement solutions include noise reduction software solutions that have been tested in typical acoustic environment. These solutions can be modified to fit your custom specification and can be used in conjunction with speech-model-based solutions if required. Contact us to discuss your voice application.