Frame Based Max Likelihood Classification of Speech & Noise

The influx of mobile services with its attendant speech-enabled technology has placed speech enhancement algorithms at the forefront of algorithms design in real time operating systems (RTOS). The most fundamental tasks of all speech enhancement algorithms is the ability to discriminate between speech and noise and suppress noise as much as possible without affecting the performance of any dependent processes. In this regard, the first step naturally is in deciding whether samples are speech, noise, or a combination of both. Some processes such as speech activity detector (SAD) and automatic gain controllers (AGC) only depend on determining whether there is speech of not; thus a binary decision on whether samples are noise only or not. Most often, samples are analyzed in frames instead of individual samples. Hence the decision has to be made on a frame of samples. A novel approach in this classification exercise is to consider the standardized scores or z scores of the samples instead of the samples themselves.

Consider received samples and background noise such that:

$s[k] \sim \mathbf{N}(\mu_s, \sigma_s^2), \forall k \in \{1,\cdots, \infty\}$

and

$\nu[k] \sim \mathbf{N}(\mu_{\nu}, \sigma_{\nu}^2) , \forall k \in \{1,\cdots, \infty\}$

where $s[k]$ is speech and $\nu[k]$ is background noise. Suppose we have estimates of $\{\mu_s, \sigma_s^2, \mu_{\nu}, \sigma_{\nu}^2\}$ , then we can synthesize the standardizes scores for any received sample $x[k]$ as:

$z_q[k] = \frac{x[k]-\mu_q}{\sigma_q}, q \in \{s, \nu\}$

Now consider a frame of length $L$ , and synthesized signal $y_n$ and $y_{\nu}$ per frame such that:

$y_q = \sum\limits_{i=1}^L z_q^2, q \in \{s, \nu\}$

Then

$y_q \sim \mathbf{\chi}^2_L, q \in \{s, \nu\}$

Then the probability density function (pdf) is:

$p(y_q | q) = \frac{1}{2^{\frac{L}{2}} \Gamma{(\frac{L}{2})}} y_q^{\frac{L}{2} -1} e^{-\frac{y_q}{2}}, q \in \{s,\nu\}$

To classify speech an noise, a hypothesis testing is performed with decisions based on:

$\frac{p(y_s | s) }{p(y_n | n) } = \left(\frac{y_s}{y_{\nu}}\right)^{\frac{L}{2}-1} e^{-\frac{y_s-y_{\nu}}{2}} \underset{noise}{\overset{speech}{\gtrless}} 1$

This can be analyzed to arrive at:

$y_s \overset{noise}{\underset{speech}{\gtrless}} y_{\nu}-(L-2) \ln{\left(\frac{y_{\nu}}{y_s}\right)}$

This classification approach works well if the signal and noise are well separated in parameter space. To reduce the effect of spurious noise, the decisions can be low pass filtered (LPF) or multiple previous frames can be averaged with the current frame. The LPF approach is more desirable in RTOS because there is no need form extra space to store previous estimates. It is also easy to see that the decision is a function of the signal level signal to noise ratio ( $SNR$ ) if $L$ is sufficiently large enough. So the decision equation becomes:

$y_s \overset{noise}{\underset{speech}{\gtrless}} y_{\nu}+(L-2)\ln{\left(SNR\right)}$

VOCAL Technologies offers custom designed solutions for beamforming with a robust voice activity detector, acoustic echo cancellation and noise suppression. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!

Complete Communications Engineering

Frame Based Maximum Likelihood Classification of Speech