Autoregressive Model For Speech Processing

A well known model for speech processing is the linear prediction model, which is accomplished in time domain. A dual for this approach exists in the frequency domain, and may be more prefered for real time systems because of the potential for savings in computational burden. The main advantage of the use of linear predictive models is to smoothen the received signals with the aim of achieving minimal distortion for nonlinear noise reduction.
We present a brief description of frequency domain linear prediction (FDLP).

Consider an acoustic signal impinging a microphone and suppose the signal at the microphone, $y$ , can be denoted as:

$y[n] = s[n] + \sum\limits_{q=1}^{Q} a_q y[n-q]+\nu[n]$

where both the nonlinear attenuation and delay have been subsumed without any loss of generality in $s[n]$ ,the source signal, and $\nu[n]$ is the noise. The $a_q, q \in \{1,\cdots, Q\}$ are the prediction coefficients. $\nu [n]$ is a zero mean ergodic process. The Z transform of the relation becomes:

$Y(z) = S(z) + \sum\limits_{q=1}^{Q} a_q Y(z)z^{-q}+\nu(z)$

Define the error signal as:

$E(z) = S(z) = Y(z)\left(1 - \sum\limits_{q=1}^{Q} a_q z^{-q} \right)-\nu(z)$

Now suppose the noise is uncorrelated with the received signal, then the expectation of the squared error term becomes:

$E^2(z) = |Y(z)|^2 \left(1 + \sum\limits_{q=1}^{Q} a_q z^{-q} \sum\limits_{q=1}^{Q} a_q z^{q} - \sum\limits_{q=1}^{Q} a_q (z^{-q}+ z^{q}) \right)+\sigma_{\nu}^2(z)$

The above equation can be sampled to obtain the discrete Fourier transform, thus:

$E^2(w) = |Y(w)|^2 \left(1 + \sum\limits_{q=1}^{Q} \sum\limits_{r=1}^{Q} a_q a_r e^{-jw(q-r)} - 2 \sum\limits_{q=1}^{Q} a_q \cos{(wq)} \right)+\sigma_{\nu}^2(w)$

The gradient of the squared error with respect to the linear coefficients then becomes:

$\frac{\partial E^2(w) }{\partial a_k} = |Y(w)|^2 \left( 2a_k +\sum\limits_{q \ne k}^{Q} a_q e^{-jw(q-k)} - 2 \cos{(wk)} \right)$

$\Rightarrow a_k= \cos{(wk)} -\frac{1}{2} \sum\limits_{q \ne k}^{Q} a_q e^{-jw(q-k)}$

The exponentials are obtained from the so-called cross coherence. In matrix form, this becomes:

$\begin{bmatrix} \frac{1}{2} & e^{-jw} & e^{-2jw} & \cdots & e^{-(Q-1)jw}\\e^{-jw} & \frac{1}{2}& e^{-jw} & \cdots & e^{-(Q-2)jw}\\& & \ddots & & \\e^{-(Q-1)jw} &e^{-(Q-2 )jw}& e^{-(Q-3)jw} & \cdots &\frac{1}{2}\\ \end{bmatrix} \begin{bmatrix} a_1\\a_2\\ \vdots\\a_Q \end{bmatrix} =\frac{1}{2} \begin{bmatrix} \cos(w)\\ \cos(2w)\\\vdots\\ \cos{(Qw)}\end{bmatrix}$

VOCAL Technologies offers custom designed solutions for beamforming with a robust voice activity detector, acoustic echo cancellation and noise suppression. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!

Complete Communications Engineering

Autoregressive model for speech processing in frequency domain

More Information