Voice Activity Detection With Adaptive Thresholding

For all speech enhancement algorithms, a voice activity detector (VAD) is utilized, not only to limit robust processing only during actual speech frames, but also to dynamically detect the noise floor. In an adaptive VAD, the threshold for speech detection is constantly being updated.
Consider an energy based VAD, with the energy computed as an average of the instantaneous temporal energies. Suppose the received signal at the microphones are given as:

$y_i[n] = s[n-\tau_i] + \nu_i[n]$

where $s[n]$ is the desired speech signal, $\tau_i$ is the relative delay with respect to microphone 1. $tau_1 = 0$ and $\nu_i[n]$ is i.i.d zero mean Gaussian noise. Then, the threshold can be adaptively computed using the equation:

$\alpha_T[n] = \beta_1 \underset{n}{argmax} \sum\limits_{m=0}^{M-1} (\sum\limits_{i-1}^{N} y_i[n-m])^2 + (1-\beta_1) \underset{n}{argmin} \sum\limits_{m=0}^{M-1} (\sum\limits_{i-1}^{N} y_i[n-m])^2$

where $M$ is the number of samples per frame, $N$ is the number of microphones in theh array and $0 \le \beta_1 \le 1$ is a design parameter. A sample performance of this algorithm is shown in Figure 1 below:

Figure 1: Adaptive VAD thresholding

The processed speech is illustrated in Figure 2 below:

Figure 2: Processed speech using adaptive VAD

VOCAL Technologies offers custom designed solutions for beamforming with a robust voice activity detector, acoustic echo cancellation and noise suppression. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!

Complete Communications Engineering

Voice activity detection with adaptive thresholding

More Information