Received signal energy based voice activity detectors, VADs,  are widely employed in broadband acoustic systems. There is a potential drawback in scenarios where there is high energy ambient noise.  Cross correlation based  VADs also have some drawback due to some level of correlation between noise samples of microphones due to their proximity to each other. An alternative approach is to use a combination of eigenvalues and coherence. The case of two microphones is discussed below.
Consider a far field acoustic signal impinging $2$ microphones with separation distance $d$ at an angle of $\theta^{\circ}$. The signal at microphone $i$, $x_i$ , can be denoted as $x_i(t) = s(t - \tau_i) + \nu_i(t-\hat{\tau}_i), i \in \{1, 2\}$

where $\tau_i =\frac{d}{c} \sin{\theta}$ is the delay of the desired signal at microphone $i$ if present, $\hat{\tau}_i =\frac{d}{c} \sin{\beta}$ is the delay of the noise signal at microphone $i$ with the expectation $\mathrm{E}[\beta] =0$, $s(t)$ is the source signal, $\nu_i(t)$ is noise and $c$ is the speed of acoustic signals.

Both $s(t)$ and $\nu (t)$ are   zero mean ergodic processes. The decision problem is whether a frame contains a signal or is a noise frame.

We utilize the imaginary value of the coherence, $i\Gamma_{x_1,x2}$, given by $i\Gamma_{x_1,x2} = \alpha_o(\omega) (\alpha_1(\omega) \sin{(w\tau)}+\sin{(\omega \hat{\tau})})$

and note that for a pure noise signal , $i\Gamma_{x_1,x2} \approx 0$. Further the eigenvalues of the frame co variance matrix will have the largest eigenvalue orders of magnitude larger that the smallest eigenvalue. The largest eigenvalue will correspond to the speech signal if present. Denote the eigenvalues as $\lambda_{max}$ and $\lambda_{min}$. We form a metric $M_{x_1,x_2} = \lambda_{max} i\Gamma_{x_1,x2}$

and compare the metric to the noise floor which is a function of previous $\lambda_{min}$ values of noise only frames. A sample of the performance of this VAD is shown on the Figure below. 