Menu

Early reflections are the echoes of a signal that arrive at the microphone within a stretch of about 30ms after the direct sound. Early reflections are direct copies of the direct sound source, rather than diffuse mixtures as are present in the late reflections, or reverberation, or a sound source. They are often visualized as shown in the figure.

In the room impulse response (RIR), the early reflections have a distinct character when compared to the other components. In particular, early reflections are represented by stronger coefficients than late reflections with more space between them. An illustration is shown below:

Early reflections add color to the sound source, and are critical for our understanding of distance in a 3D sound field. A signal with strong early reflections will still sound distant, warm, and generally “roomy” despite the complete elimination of late reflection. Therefore, it is imperative that any dereverberation algorithm focusing on voice quality enhancement deal with these distinct early reflections separately.

Early reflections show up clearly in the autocorrelation of a speech signal, while they are all but hidden in the spectrogram. The autocorrelation measures how similar a signal is to a time shifted version of itself. Since a speech signal with early reflections contains the direct sound source as well as slightly delayed copies as shown in Figure 1, we will see artificially high sidelobes in the autocorrelation. Consider the autocorrelations of a clean and reverberant stationary voiced segment:

Figure 3: Autocorrelation Difference

As the figure shows, higher side lobes are exactly what we see. Since early reflections are just that, early, the effect becomes less pronounced as time lag is increased. The first sidelobe is the most affected, as expected, while this particular sample has a relatively strong reflection in the third sidelobe. This translates into reverberant autocorrelations having higher energy than clean ones, regardless of the presence of noise.

In a dereverberation algorithm for voice quality enhancement, it is important that early reflections are handled. Here we will discuss the three most widely used methods for dealing with early reflections and thereby reducing coloration and the perception of distance.

Autocorrelation shaping [2] is a method that adaptively modifies the autocorrelation of reverberant speech with early reflections to mold it into something more like clean speech without early reflections. For a given microphone, the autocorrelation of a stationary reverberant speech sample is computed as:

Where our block processing notation allows us to use the inner product (•|•) for linear convolution. The correlation shaper will take the form of a filter such that the shaped time domain output will take the form:

With the autocorrelation of the desired output denoted as *R _{dd}*, we compute the error signal to minimize:

Where W[*τ*] is a weight to emphasize the importance of the first few peaks which have the greatest distortion due to early reflections. We then obtain the iterative update equation:

Where *l* represents the update step, and *n* represents discrete time. The gradient ∇ is given by:

Where *R _{xy}* denotes the cross correlation between the reverberant speech and the updating

output. We want to only select lags in a set

Early reflection cancellation via kurtosis maximization was developed in [3]. Its low computational complexity and easy to implement algorithm make it one of the most attractive options for dealing with early reflections. It works by adaptively computing filter taps that maximize the kurtosis of the incoming frame’s linear prediction residual. By assuring quasi-stationarity, this filter can be directly applied to the incoming time domain frame thus avoiding the linear prediction synthesis filter. A block diagram is shown below:

Figure 4: Kurtosis Maximization [3]

We can create the linear prediction analysis filter *A(z)* via Leroux-Gueguen Lattice Filter Recursion as explained in another article on our website. We are building an adaptive filter, therefore we need a reward function to converge towards. As expected, this function is the kurtosis, given by:

Where is the output of the adaptive filter applied to the linear prediction analysis filter output , and represents the sample average. For long term adaptation, these averages should come from an exponential smoother. Since , we can take the approximate gradient with respect to via:

The update equation for the kurtosis maximization filter is then given by:

To speed up convergence, we typically use a frequency domain implementation of the above discussion. The block diagram remains the same, except each vector is now in the frequency domain. To illustrate how easily this isomorphism is implemented, we consider the frequency domain update equation:

Where m represents the short time frame index, and k represents the discrete frequency bin index. [3] reports that good results were obtained with *μ = 0.0004* and *H[0, : ] = [1,…,0] ^{Τ}* . Also in [3], the authors used a Modulated Complex Lapped Transform in place of an FFT. As [4] illustrates, the the MCLT can be implemented by first performing an FFT and then performing a simple subband weighting. Actually, any invertible Gabor transform will work, for instance the Invertible Constant Q, Wavelet Transforms, and the Bark and Mel Transforms.

Early reflections being picked up by a microphone are the principle cause of the perception of distance in an audio signal. Early reflections are modeled as sparse high energy impluses in the room impulse response, and have been shown to appear as extra energy in the first few peaks of the signals autocorrelation. Two methods for early reflection cancellation are autocorrelation shaping, and kurtosis maximization, which can both be implemented adaptively. When combined with a method for dealing with late reflections, a complete dereverberation method is created.

[1] A. Oetzmann, D. Mazzoni, Reverb.” Internet: http://audacity.sourceforge.net//manual – 1.2//effects reverb.html,” Jan.28, 2005 [May. 3, 2013].

[2] B.W. Gillespie, L. Atlas, “Strategies for improving audible quality and speech recognition accuracy of reverberant speech,” Proc. IEEE ICASSP, 2003

[3] B.W. Gillespie, H.S. Malvar, and D.A.F. Florencio, Speech dereverberation via maximum-kurtosis subband adaptive filtering, in Proc. Int. Conf. Acoust., Speech, Signal Processing, vol. 6, pp. 3701-3704, 2001.

[4] H.S. Malvar, Fast Algorithm for the Modulated Complex Lapped Transform, Microsoft Corp., Redmond, WA, MSR-TR-2005-2 , 2005.

VOCAL Technologies, Ltd.

520 Lee Entrance, Suite 202

Buffalo, NY 14228

Phone: +1 716-688-4675

Fax: +1 716-639-0713

Email: [email protected]

© 2022 VOCAL Technologies