Speech recognition is of great interest to the world of academics and industry. There are two main classes of speech recognition systems – connected speech recognition and command and control systems. Connected speech recognition systems work with continuous speech to closely transcibe the exact speech for various applications. These systems can be expensive and hard to come by in an industrial setting.
Currently, most speech recognition systems actually deployed are for command and control applications. In this type of system, the user has a limited set of spoken commands to select system actions that the speech recognizer controls. As such, a command and control type of system is better suited for already well understood vector quantization techniques.
Speech Recognition Design
The key to a successful speech recognition attempt is the type and quality of the feature vector components you are using. Currently, the most often used features are the mel frequency ceptral coefficients (MFCCs). These have been shown to be fairly robust to white gaussian noise and reverberation, and are computationally efficient. In practical settings, there is often colored and correlated interference, which severely degrades the utility of these coefficients. However, by using differential MFCCs, we can get some of the robustness back.
Speech recognition systems will always perform better when the speech signal is cleaner than when it is not. Knowing this, we can begin to design front end processors to clean up the speech signal for the speech recognition system. When we know how the system operates, this is an easy task. However, when we don’t exactly know, we have to design a more robust system such that it produces what one would call a natural spectrogram. That is, the speech comb and column are left largely intact.
Many speech enhancement systems are focused on audio quality. That is, they are most concerned with what a human would likely rate as pleasing to the ear. As any audio engineer knows, audio quality is not synonymous with a clean spectrogram, though it is correlated. Therefore, while many established methods do indeed produce what we perceive as quality output, the spectrogram is often left in a poor state for a speech recognition system.
In other words, a proper front end for a speech recognition system needs to be focused on harmonic preservation, or at least, have a module for harmonic reconstruction. There are many promising methods for doing just that, ad hoc and otherwise. One of the most promising methods is the STFT Phase Reconstruction method that can be used concurrently with any magnitude based speech enhancement system.