Estimation of the angle of arrival (AoA) of acoustic signals is essential in acoustic beamforming. There are therefore various research on high resolution algorithms for estimation of AoA. One such algorithm is steered response power with phase transform, SRP-PHAT, which is touted because of its robustness in reverberant environments. The drawback for SRP-PHAT algorithm is the computational cost for the grid-search component, which is akin to simulated annealing. We present a go-around to this problem of computational burden by noting that our interest lies in the AoA and not the location of the originating source. To be precise, the AoA estimation problem is as follows:

Consider a far field acoustic signal impinging a uniform linear array (ULA) of $M$ microphones with separation distance $d$ at an angle of $\theta^{\circ}$. The signal at microphone $i$, $x_i$ , can be denoted as

$x_i(t) = h_i \ast s(t - \tau_i) + \nu_i(t), i \in \{1, \cdots, M\}$

where $h_i$ is the channel impulse response, $\ast$ denotes convolution, $\tau_i$ is the delay at microphone $i$, $s(t)$ is the source signal and $\nu_i(t)$ is a zero mean ergodic process. The setup is as shown in Figure 1.

ULA of $M$ microphones with pairwise distance of $d$

Conventionally, SRP-PHAT makes use of frequency domain computation. However, it is easy to show that using the physical constraints on the separation distances of the microphones, for each pair of microphones,  $\mathcal{O}(ND_{i,j})$ computations will suffice as opposed to $\mathcal{O}(\hat{N}\log_2{\hat{N}})$ using the fast Fourier transforms approach, where $\underset{n \in \mathbb{Z}}{\mathrm{argmin}}~ \hat{N}=2^{n}\ge N$ and $D_{i,j} << N$ is the maximum sample delay between the two microphones, $i$ and $j$. Thus, we will leverage the time domain analytic equation for SRP-PHAT

$\underset{\theta \in \frac{\pi}{2} [-1, 1]}{\mathrm{argmax}}~ P(\theta)=2 \pi \sum\limits_{i=1}^M \sum\limits_{j=1}^M R_{i,j} (\tau_i(\theta) -\tau_j(\theta) )$

where $R_{i,j} (\tau_i(\theta) -\tau_j(\theta) )$ is the cross-correlation between the data from microphones $i$ and $j$ with $\{i,j\}\in M, i\neq j$ and $\tau_i(\theta) \in \mathbb{Z}~ \forall i$ is an integer delay corresponding to a look angle. The main idea here is in the details of the look direction. The Sampling frequency and the microphone separations place a limit on a finite number of look angles with each look angle corresponding to a set of delays across all microphones. Suppose the maximum delay samples between the reference and microphone $i$ is $D_i \in \mathbb{Z}$, then the cardinality of the look directions,$\left(|\theta| \right)$) with each look direction having M-tuple delays, one for each microphone is given by

$|\theta|=2 \prod\limits_{i=1}^{M-1} \left(D_{i+1} - D_i +1\right)$

where the factor of two is for positive and negative angle of arrivals.

For example, for three microphones uniform linear array with pairwise consecutive microphone distance corresponding to $k$ delay samples, there will be $2(k+1)^2$ 3-tuples of look directions which can be memoized if needed. The M-tuple corresponding to the angle that maximizes $P(\theta)$ is returned with the AoA given by the least squares expression

$\hat{\theta} = \arcsin{\left(\frac{c}{F_s} \frac{\sum\limits_{i=1}^{M-1}\sum\limits_{j=i+1}^M (d_j - d_i)(\tau_j(\theta)-\tau_i(\theta))}{\sum\limits_{i=1}^{M-1}\sum\limits_{j=i+1}^M (d_j - d_i)^2}\right)}$

with corresponding time difference of arrivals being $\tau_j(\theta) -\tau_i(\theta)$ for the $\theta$ that maximizes $P(\theta)$.

The computational complexity for $M$ microphone ULA with frame size of $N$ is upper bounded by

$\mathcal{O}\left(M^2 N D_M + \prod\limits_{i=1}^{M-1} \left(D_{i+1} - D_i +1\right) \right)$

VOCAL Technologies offers custom designed AoA estimation solutions for beamforming with a robust voice activity detector. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!