Automatic Speech Recognition (ASR) has penetrated everyday life and although the limitations of the technology are well recognized, the progress in this area observed in the last several years has been unquestionable (cf. Refs. [1, 2]).
There are many different aspects of ASR that are pursued with vigor and one of these aspects is Distant Speech Recognition (DSR). DSR is considered as being a part of Automatic Speech Recognition (although the word “Automatic” was dropped from this new term) with emphasis on “Distant”. In essence, DSR is a variant of ASR performed under severe acoustic/environmental conditions, which are affecting the quality of the speech signal (or signals) as observed at the input port(s) to the ASR engine (or the DSR engine).
Successful DSR would offer the most natural human-computer interface: instead of relying on intrusively mounted microphones on the speaker’s body (such as head-mounted microphones), the user would enjoy a freedom from body-mounted devices while being able to conduct a dialog with a machine with the same degree of speech recognition accuracy as in the case of having microphones located very close to the speaker’s mouth.
Distant speech is polluted with uncorrelated noise (or, in general, uncorrelated distractors) and correlated undesired signals (such as reflections of the desired signal – discrete early reflections and the ones which are late and producing reverberation effect). In addition to that, the distant speech is often affected by acoustic wave effects such as wave diffraction, particularly when the line of sight between the sound source (speaker’s mouth) and the sensor (i.e., microphone ) traverse through physical obstacles, thus typically causing linear distortions at higher frequencies (i.e., essentially producing a muffling effect), Ref..
As such, developing applications that recognize distant speech robustly remains a challenge. Part of this challenge is related to having reliable beamforming for the direction-of-arrival estimate (DOA), (cf., Ref. [4, 5]). In this note we highlight some of the aspects of DSR that appear to be promising from the viewpoint of improving Word Error Rate (WER). Recent work on acoustic beamforming (as reported in Ref., among other things) indicates that starting from WER of 14% when a single microphone is used, a state-of-the-art DSR system (or rather its acoustic front end – with the ASR engine’s functionality remains unchanged) can achieve a WER of 5% (with both figures rounded to the nearest integer) which is very comparable to WER of 4% obtained with a lapel microphone, ibid.
In Figure 1 a block diagram of a typical DSR system is depicted. The system consists of several typical blocks: (a) microphone array, (b) speaker tracker (or beamsteering), (c) beamformer (BF), (d) postfilter, and (e) the speaker recognizer. Once the position estimate is completed, the BF emphasizes sound waves coming from the direction of interest or from the “look direction”. The beamformed signal is further enhanced by the postfilter. The final enhanced signal is fed to the speech recognizer (which is typically a kind of ASR engine).
Other aspects of the beamforming such as speaker tracking/beamsteering, sidelobe cancellation are described in VOCAL documents posted at the VOCAL portal (e.g., Ref. ).
Vocal Technologies Ltd developed adaptive beamforming for noise reduction. This solution can be used to enhance the ASR and DSR. Contact us to discuss your speech application with our engineering staff.
- ON AUTOMATIC SPEECH RECOGNITION AND VOICE ENHANCEMENT
- Microphone Arrays; Signal Processing Techniques and Applications, M. Barndstein and D. Ward (Editors); Springer Verlag 2001 (Signal and Communication Technology series), Patrick A. Naylor and Nikolay D. Gaubitch (editors), Springer-Verlag London Limited 2010
- Modeling Acoustics in Virtual Environments Using the Uniform Theory of Diffraction, Tsingos,N. et al. Proceedings of ACM SIGGRAPH 2001, August 2001.
- Microphone Array Signal Processing, J. Benesty et al., Springer Verlag, 2008.
- Beamforming for Direction-of-Arrival (DOA) Estimation-A Survey, Krishnaveni, V., at el., Int. Journ. of Comp. Appl., Vol. 61– No.11, January 2013
- Microphone Array Processing for Distant Speech Recognition (from close-talking microphones to far-field sensors), Kumatani, K., et al., IEEE SP Mag., Nov. 2012; pp.127-140.