Call Today 716.688.4675

# Adaptive voice activity detection using time domain zero crossings

For all speech enhancement algorithms, a voice activity detector (VAD) is utilized, not only to limit robust processing only during actual speech frames, but also to dynamically detect the noise floor. A VAD essentially is designed to distinguish noise from non noise frames. We can therefore use any number of the characteristics of noise which is not present in speech. One such characteristic is the number of zero crossings, which on average is less than the number observed in i.i.d. noise. Consider an zero crossing count based VAD, with the number of zero crossings computed computed by using a simple difference and comparator unit such that we count the number in a frame that satisfy $y[n] - y[n-1] < y[n]$. Notice that this is true for both positive to negative transitions and negative to positive transitions. Suppose the received signal at the microphones are given as:

$y[n] = s[n] + \nu_i[n]$where $s[n]$

is the desired speech signal and $\nu_i[n]$ is i.i.d zero mean Gaussian noise. Then, the threshold can be adaptively computed using the equation:

$\alpha_T[n] = \beta_1 \underset{n}{argmax} \sum\limits_{m=0}^{M-1} ( y[n-m] - y[n-m-1] < y[n-m]) + (1-\beta_1) \underset{n}{argmin} \sum\limits_{m=0}^{M-1} ( y[n-m] - y[n-m-1] < y[n-m]))$

where $M$ is the number of samples per frame and $0 \le \beta_1 \le 1$ is a design parameter. A gradual decay and magnification is also used for the maximum and minimum levels to prevent being stuck at a spurious point. A sample performance of this algorithm is shown in Figure 1 below: