The ability to detect synthesized speech is an important application in Automatic Speaker Verification (ASV) systems. As discussed in Speaker Verification Spoofing Attacks, utilizing Text-to-Speech Speech Synthesis is one type of malicious spoofing attack of speech based biometric authentication. The challenge that these type attacks provide over other spoofing methods is that it is a high actively area of research, and there is an increasing number of effective methods for generating artificial speech. Modern deepfake audio is able to deceive trained human listeners, requiring sophisticated detection algorithms to protect against these attacks. [1]

In order to detect synthesized speech from a variety of impostor sources, both short-term and long-term speech features need to be extracted from the time-domain audio signal. Fake speech often fails to recreate the time variance of natural speech. The changing physiological and emotional state of a genuine talker will create slight modifications to the speech features. In addition, due to the quantization in the models generating the spoofing signal, there can be unnatural jumps in the features of speech that are unperceivable to human listeners, but are detectable by machines. [2]

An example set of features that can be extracted include, bicoherence, Mel-frequency Cepstrum Coefficients (MFCC) variance, F0 statistics, group delay, spectrum modulation, residual log magnitude spectrum. These features are evaluated on an intra-frame basis, and can be fed to a spoofing classifier, which can be trained to determine the validity of the signal.

speech spoof classifier

VOCAL Technologies has been doing speech signal processing for over 35 years. Spoofing detection software can be used as a front-end to ASV system to help detect when ASV spoofing attacks occur.

 

More Information