Voice Activity Detectors (VAD) play a major role in the telecommunications. There are various methods for determining the presence of voice for these applications, they are:
- energy thresholds (non-adaptive and adaptive)
- waveform and spectrum analysis
- pitch and harmonic detection
- periodicity measures
- zero-crossing rates
- high order statistics of the LPC residuals
- Based on statistical models
The most simple VAD schemes are based on a energy detector. If the energy of the signal rises a threshold amount above the noise floor, then the increase in energy is assumed to be to associated with voice. Since the noise floor in most applications is not know a priori and is time-varying, it has be to estimated during throughout the call. An adaptive method calculating the noise floor, , is the dual time constant integrator. In other words, if the energy of the signal at instance N, is less than the noise floor then the noise floor will lower quickly, but if the energy is greater than the noise floor, then the noise floor rises slowly. This method is useful in situations in which the noise floor varies slower the speech envelope. To further improve the performance of the VAD based on energy thresholds, state machines with varying thresholds can be used. For example, speech has a strong time correlation. Thus, the states of the previous frames can help determine probability of speech in the current frame.
In waveform and spectral analysis for voice activity detection makes use of the known characteristics of the speech. Applying VAD in this method is more computational intensive than energy based solutions, but are able to detect noise in non-stationary noise and low SNR scenarios. For example, voiced speech contains a strong fundamental frequency with it’s harmonics. Thus, the analysis of cepstrum of a signal can reveal source of the signal energy. In other words, if the spectral energy energy has a periodic nature to it, the cepstrum will have a peak related to that periodicity and to voice. If the cepstrum is flat, then signal energy is could be from something like a door slamming or clapping. The kurtosis of the LPC residuals also reveal similar characteristics of speech. Clean speech residuals have a large kurtosis. Thus, if the residuals of the LPC have a low kurtosis (more uniform distribution PDF), then signal is less likely to represent voiced speech.
VOCAL’s team of engineers are experts in the use of VAD in a wide arrange of signal processing challenges and environments.VADs are used in the following areas of signal processing:
- Echo cancellers − for control and various estimation routines
- Noise reduction systems − for aiding in the estimations of the noise spectrum and the probability of speech presence
- Vocoders − for determining when silence suppression packets can be sent
- Speech Recognition − for removing periods of noise that will lower recognition rates