The voice activity detection (VAD) is a sensitive component in spectrum subtraction. Its performance can dramatically influence the noise reduction level and the speech distortion severity.

The power spectrum of a speech segment has several characteristics that can be used to decide on the presence or absence of speech. The commonly used measures include short-time averaged energy, zero-crossing rate, and linear prediction coefficients etc.

Voice Activity Detection
Figure 1: A Basic Voice Activity Detection (VAD) System

Figure 1 shows a basic Voice Activity Detection, or VAD, system. It has three main components. Feature extraction unit analyzes and calculates a set of decision metrics, feature vectors, to be used for VAD threshold computation and VAD decision making. Various implementations may use application specific definitions of feature vectors.

Energy based approaches are the most often used due to their simplicity. The design rationale is that speech bursts have higher transient power than the background noise. A long-term energy averaging can be used to measure the background noise level while a short-term averaging can be used to measure the transient speech power. If the short time averaging is higher than the long time averaging by a certain threshold, VAD declares speech presence. The decision mechanism may be implemented by the following logic,

VAD\ =\ 1,ifEs>El0,         otherwise

where E_s and E_l are the short- and long-term energy values. They can be calculated frame by frame in time domain.

E_s=\frac{1}{N}\sum_{t\ =\ 1}^{N}{x^2\left(t\right)}

where N is the frame length.

The long term can be obtained from a frame by frame moving average with an adjustable decay parameter as below,

E_l=\left(1-\gamma\right)E_s+\gamma\ E_l^{OLD}

where E_l^{OLD} is from previous frame and \gamma is the decay parameter.

The performance of the VAD can obviously be improved if a feedback approach is adopted to evaluate the long term energy smoothing.