Blind source separation of far field source is used to recover an unknown number of sources using observable mixture signals. Consider a noise free instantaneous mixture signals described below: $\begin{bmatrix} x_1(w,t) \\ x_2(w,t) \\ \vdots \\ x_N(w,t) \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1M}\\ a_{21} & a_{22} & \cdots & a_{2M}\\ \vdots & \vdots & \vdots & \vdots\\a_{N1} & a_{N2} & \cdots & a_{NM} \end{bmatrix} \begin{bmatrix} s_1(w,t) \\ s_2(w,t) \\ \vdots \\ s_M(w,t) \end{bmatrix}$

where there are $M$ sources and $N$ observations. The signals are in time-frequency domain because speech signals have been shown to be sparser in time-frequency domain as opposed to time domain. Under the assumption that only one source is active per time-frequency bin, which implies the signal space is sparse, for each time-frequency bin, we have $x_i(w,t) = a_{ik}s_k(w,t)$

For the 2 microphone case, this reduces to: $\frac{x_1(w,t)}{x_2(w,t)} = \frac{a_{1k}}{a_{2k}}$

Using k-mean clustering, the ratios of $\frac{a_{1k}}{a_{2k}}$ are used to determine the M parameters up to a scaling term. In real situations, the signals are noisy and the ratios will not be exact. In such scenarios, an alternative approach is used. Consider the cross covariance matrix, denoted $R_{X(w,t)}$ given as $R_{X(w,t)} = \begin{bmatrix} x_1(w,t) \\ x_2(w,t) \\ \vdots \\ x_N(w,t)\end{bmatrix} \begin{bmatrix} x_1(w,t) & x_2(w,t) & \cdots & x_N(w,t)\end{bmatrix} ^*$

where $*$ denotes the complex conjugate, then the SVD of $R_{X(w,t)}$ will give a single significant eigenvalue if there is only one active speaker per time frequency bin. $R_{X(w,t)}$ can be averaged to approach the expected value by using $\mathbb{E}[R_{X(w,t)}] = \frac{1}{T} \sum\limits_{w} \begin{bmatrix} x_1(w,t) \\ x_2(w,t) \\ \vdots \\ x_N(w,t)\end{bmatrix} \begin{bmatrix} x_1(w,t) & x_2(w,t) & \cdots & x_N(w,t)\end{bmatrix} ^*$

where $T$ is the length of the STFT. Since there is only one active signal assumed, the eigenvalue corresponding to the principal eigenvalue is chosen as the estimate of the $k^{th}$ active signal, $a_{i,k} \forall i$. Clustering is subseqeuntly used to separate the desired signals.

VOCAL Technologies offers custom designed solutions for BSS and beamforming with a robust voice activity detector, acoustic echo cancellation and noise suppression. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!