Replay attacks of Automatic Speaker Verification (ASV) occur when the audio of an authorized user is covertly recorded, and is played back by an impostor to fool the authentication system. As compared to impersonation, speech synthesis and voice conversion attacks, one does not require specific knowledge of speech processing to perform the attack.

One assumption that can often be made about ASV systems and replay attacks is that the genuine speech source is physically located closer to the ASV device than the covert recorder is to the user. Therefore, being able to detect if the captured speech was located in the far-field and/or if the signal was played out of a loudspeaker is a key to determining if the speech is valid.

The distance effect states the sound level decreases by 6dB at each doubling of the distance. Reverberations are independent of distance, so when the distance increases the direction path SPL decreases the reverberation levels will remain constant. This, a far-field recording will have low SNR and direct to reverberation ratio (DRR) ratios resulting in a flatter frequency response and higher local minimums of spectral envelopes. Audio generated by the loudspeakers of smart phones as compared to natural human speech have much lower signal energy between 100 to 300 Hz.[1] A classifier evaluating the spectral flatness, the low frequency ratio and the modulation index of the spectral envelope of the signal under test will improve the resiliency of an ASV system to replay attacks.

VOCAL Technologies has been doing speech signal processing for over 35 years. Spoofing detection software can be used as a front-end to ASV system to help detect when ASV spoofing attacks occur.

More Information