The voice activity detection (VAD) is a sensitive component in spectrum subtraction. Its performance can dramatically influence the noise reduction level and the speech distortion severity.

The power spectrum of a speech segment has several characteristics that can be used to decide on the presence or absence of speech. The commonly used measures include short-time averaged energy, zero-crossing rate, and linear prediction coefficients etc.

Figure 1 shows a basic VAD system. It has three main components. Feature extraction unit analyzes and calculates a set of decision metrics, feature vectors, to be used for VAD threshold computation and VAD decision making. Various implementations may use application specific definitions of feature vectors.

Energy based approaches are the most often used due to their simplicity. The design rationale is that speech bursts have higher transient power than the background noise. A long-term energy averaging can be used to measure the background noise level while a short-term averaging can be used to measure the transient speech power. If the short time averaging is higher than the long time averaging by a certain threshold, VAD declares speech presence. The decision mechanism may be implemented by the following logic,

$VAD\ =\ 1,ifEs>El0, otherwise$

where $E_s$ and $E_l$ are the short- and long-term energy values. They can be calculated frame by frame in time domain.

$E_s=\frac{1}{N}\sum_{t\ =\ 1}^{N}{x^2\left(t\right)}$

where N is the frame length.

The long term can be obtained from a frame by frame moving average with an adjustable decay parameter as below,

$E_l=\left(1-\gamma\right)E_s+\gamma\ E_l^{OLD}$

where $E_l^{OLD}$ is from previous frame and $\gamma$ is the decay parameter.

The performance of the VAD can obviously be improved if a feedback approach is adopted to evaluate the long term energy smoothing.