VAD | Measuring Detector Performance

VOCAL’s Voice Activity Detection (VAD) is a state-of-the-art highly optimized flexible solution which can be tailored specifically for the needs of your application. It works in combination with other algorithms to do their job effectively, or to maximize the performance of our Voice Quality Enhancement (VQE) Stack. Our software is optimized for performance on DSPs and conventional processors from TI, ADI, ARM, AMD, Intel and other leading vendors. Call us today to discuss your application requirements.

The most critical part of any speech enhancement system is the Voice Activity Detector (VAD). Without a robust detector, many of the assumptions made by various popular speech enhancement algorithms cannot be satisfied which ultimately leads to degraded performance. VOCAL’s Voice Quality Enhancement Stack offers a highly optimized VAD for accurately predicting speech across a wide variety of environments. Here we will discuss what it means to have a highly optimized Voice Activity Detector.

Obviously, a detector that has an error rate of zero is the ultimate goal. However, this is generally not practical given the large number of different scenarios a VAD must encounter. Instead, we optimize our VAD carefully, by teasing apart just what it means to make an error. Indeed, there are two difierent kinds of errors that we need to be aware of:

Table 1: Error Types

We can visualize this table by creating what is called a Confusion Matrix for our Decision Module. Essentially, a confusion matrix represents the table in percentages over all the possible occurances, sometimes referred to as rates. A confusion matrix is an especially critical method for visualizing this table in higher dimensions. For our Frost Beamformer, a (poor) confusion matrix would look like:

Figure 1: Frost Confusion Matrix

For a specific application, one kind of error may be more critical. This is what is known as risk in a decision, and these risks can be quantified and optimized against. An example of when False Negatives are more risky is in the estimation of the noise autocorrelation matrix. This estimate is often of great importance, perhaps most critically in adaptive beamforming. The popular Frost beamformer technically needs the exact autocorrelation matrix of the noise in order to be statistically optimal in the maximum SNR sense. Any signal leakage into this estimate severely degrades its performance relative to optimum, in the sense that the taps take orders of magnitude longer to converge depending on the degree of leakage.

We can visualize our VAD’s ability to mitigate such risks by examining the Receiver Operator Characteristic of the Decision Module. This is a curve that illustrates the tradeoff between any one of these four rates, be them positive or error rates. For the example of the Frost beamformer, we would be interested in the tradeoff between the true and false negative rates:

Figure 2: Frost Receiver Operating Characteristics are optimal for True Negative Rate of ~80%

For this application, we would want to select the corner point, where the True Negative Rate is just above 80%. By moving to the right from this point, we will get marginally less improvement of the False Negative Rate, and, if we move to the left, substantial degradation in True Negative Detection. Therefore, this corner is the optimal point for this application’s decision module.

Complete Communications Engineering

Measuring Detector Performance in Voice Activity Detection