A Basic Voice Activity Detection (VAD)

The voice activity detection (VAD) is a sensitive component in spectrum subtraction. Its performance can dramatically influence the noise reduction level and the speech distortion severity.

The power spectrum of a speech segment has several characteristics that can be used to decide on the presence or absence of speech. The commonly used measures include short-time averaged energy, zero-crossing rate, and linear prediction coefficients etc.

Figure 1: A Basic Voice Activity Detection (VAD) System

Voice Activity Detection

Figure 1 shows a basic Voice Activity Detection, or VAD, system. It has three main components. Feature extraction unit analyzes and calculates a set of decision metrics, feature vectors, to be used for VAD threshold computation and VAD decision making. Various implementations may use application specific definitions of feature vectors.

Energy based approaches are the most often used due to their simplicity. The design rationale is that speech bursts have higher transient power than the background noise. A long-term energy averaging can be used to measure the background noise level while a short-term averaging can be used to measure the transient speech power. If the short time averaging is higher than the long time averaging by a certain threshold, VAD declares speech presence. The decision mechanism may be implemented by the following logic,

$VAD\ =\ 1,ifEs>El0, otherwise$