Max Likelihood Classification of Speech & Background Noise

The influx of mobile services with its attendant speech-enabled technology has placed speech enhancement algorithms at the forefront of algorithms design in real time operating systems (RTOS). The most fundamental tasks of all speech enhancement algorithms is the ability to discriminate between speech and noise and suppress noise as much as possible without affecting the performance of any dependent processes. In this regard, the first step naturally is in deciding whether samples are speech, noise, or a combination of both. Some processes such as automatic gain controllers (AGC) only depend on determining whether there is speech of not; thus a binary decision on whether samples are noise only or not. A novel approach in this classification exercise is to consider the peak signal levels and assume that they are Gaussian distributed. Also assume that the background noise is Gaussian distributed. Then, we want to classify the peak levels, frame by frame, or aggregated frames by aggregated frames.

Consider the peak signal levels of noise and speech denoted $l_s$ and $l_n$ respectively such that:

$l_s \sim \mathbf{N}(\mu_s, \sigma_s^2)$

and

$l_n \sim \mathbf{N}(\mu_n, \sigma_n^2)$

Then the probability density function (pdf) is:

$p(l_q | q) = \frac{1}{\sqrt{2\pi \sigma_q^2}} e^{-\frac{(l_q -\mu_q)^2}{2\sigma_q^2}}, q \in \{s,n\}$

Define the standard score or the z-score as $z_q\frac{(l_q -\mu_q)}{\sigma_q}$ , then the pdf can be written as:

$p(l_q | q) = \frac{1}{\sqrt{2\pi \sigma_q^2}} e^{-\frac{z_q^2}{2}}, q \in \{s,n\}$

To classify speech an noise, a hypothesis testing is performed with decisions based on:

$\frac{p(l_x | s) }{p(l_x | n) } = \frac{\sigma_n}{\sigma_s} e^{-\frac{z_s^2 - z_n^2}{2}} \underset{noise}{\overset{speech}{\gtrless}} 1$

This can be analyzed to arrive at:

$z_s^2 \overset{noise}{\underset{speech}{\gtrless}} z_n^2-2 \ln{\left(\frac{\sigma_s}{\sigma_n}\right)}$

This classification approach works well if the signal and noise are well separated in parameter space. To reduce the effect of spurious noise, the decisions can be low pass filtered (LPF) or multiple previous frames can be averaged with the current frame. The LPF approach is more desirable in RTOS because there is no need form extra space to store previous estimates. It is also easy to see that the decision is a function of the peak signal level signal to noise ratio ( $SNR_l$ ) since $SNR_l = \frac{\sigma_s^2}{\sigma_n^2}$ . Note that this is note the actual signal to noise ration because the variances here are of the peak signal level and not of the speech or noise signals. So the decision equation becomes:

$z_s^2 \overset{noise}{\underset{speech}{\gtrless}} z_n^2-\ln{(SNR_l)}$

VOCAL Technologies offers custom designed solutions for beamforming with a robust voice activity detector, acoustic echo cancellation and noise suppression. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!

Complete Communications Engineering

Maximum likelihood classification of speech and background noise