When you call: fake!
Today, just a few seconds of a voice recording are enough to clone a voice and make it say anything you want. The rapid progress of voice cloning technologies is likely to have serious consequences for the economy and society. The most important antidotes: education, trained hearing and even better trained recognition systems.

An audio clip in which British Prime Minister Keir Starmer allegedly confessed to defrauding his electorate on the X platform was viewed 1.4 million times. The Ferrari company narrowly escaped a deepfake scam last summer. A prudent manager reacted cleverly to a suspicious call from the deceptively genuine-sounding company boss: he asked a question that only the real boss could know the answer to.
Cases of disinformation, fraud and industrial espionage using fake voices are on the rise worldwide. According to the Entrust Cybersecurity Institute's Identity Fraud Report, there was a deepfake fraud attempt every five minutes in 2024. The security provider Signicat registered an increase of 2,137 percent in such attacks on European banks, insurance companies and payment specialists within three years. At the same time, the new possibilities of AI-generated speech are not only associated with risks, but also opportunities: whether it is the reconstruction of the voice of speech-impaired people, new developments in the dubbing of films or even the digital preservation of the voices of deceased people.
One thing is certain: Deepfake technologies will increasingly change our media reality. This is the conclusion of the study «Deepfakes and manipulated realities» by the Fraunhofer Institute for Systems and Innovation Research ISI. The authors' recommendations: In addition to government efforts to regulate platforms, the personal responsibility of each and every individual must be improved through appropriate educational programs. With high journalistic standards, the media can contribute to better recognition and education of the population. In addition, companies and organizations should prepare for the increasing spread of deepfakes through internal risk assessments as well as preventive and reactive measures.
Human vs. AI: Who is better at recognizing counterfeits?
In contrast to the still quite complex creation of deepfake videos, high-quality audio content can be manipulated with comparatively little effort. At the same time, they are more difficult to identify because there are no visual clues. How good are people at recognizing such manipulated audio tracks? Dr. Nicolas Müller from the Fraunhofer Institute for Applied and Integrated Security AISEC investigated this in an experiment. He pitted 472 participants against an AI algorithm in a game to distinguish between real and fake audio samples. Both the humans and the AI each heard an audio track and had to decide whether it was a real voice or a deepfake.
«Without training, people fall for one in three fake voices.»
The result after almost 15,000 files were listened to: «Humans recognize around two thirds of counterfeits without training, but can work their way up to 80 percent with a little practice,» says the researcher. «The AI's success rate is well over 95%, depending on the level of difficulty.» However, the game also provided other valuable findings: older people are more likely to be fooled by deepfakes than younger people. Native speakers show clear advantages over non-native speakers, but IT professionals do not. «These findings can be helpful in developing more effective cyber security training programs and improving detection algorithms,» explains Müller. Because practice is such an important factor in recognizing AI-generated audio fakes, he and his team have published the interactive game «Spot the Deepfake» on their platform Deepfake Total (see QR code), making it accessible to everyone.
Audio fake detection: diversity wins
Nicolas Müller and his team developed the Deepfake Total platform as a public detection tool for audio fakes. Anyone can upload suspicious audio tracks there free of charge and have them analyzed by an AI. Unlike other commercial detection tools on the market, the Fraunhofer platform is free of charge - and hosted in Germany. The researchers train their AI model using both public and self-generated data sets containing examples of original and fake audio tracks. The reliability of the recognition depends on the quality of this training data. The aim is not only to collect as much data as possible, but also to combine it cleverly and process it in a balanced way so that there are no undesirable learning effects. «The only distinguishing feature in a good training data set should be whether the car track is genuine or false,» explains Müller. «It is therefore important to avoid the AI learning that men's voices are more often fake than women's, for example, or differentiating data sets based on background noise, accent, length or volume.» Because the data comes from such different sources, this is not so easy. «You have to understand what individual pieces of information these audio tracks contain and then arrange them in such a way that the irrelevant characteristics are as balanced as possible. While it's already easy to analyze which part of the image the AI uses to differentiate between videos, it's a little more difficult with audio.»
The researchers at Fraunhofer AISEC are continuously developing such a dataset with the Multi-Language Audio Antispoofing Dataset (MLAAD). It forms the basis for training their AI recognition model, but is also publicly available to the research community. The challenge: there are a variety of text-to-speech systems for manipulating audio tracks, each with their own characteristics. While some are good at generating emotional speech, others create an almost perfect vocal resemblance to the target person. In order to cover as many such characteristics as possible, the MLAAD data set currently comprises over 90 different such systems and is constantly being expanded to include the latest ones. This enables the tool to achieve high recognition rates even for new, as yet unknown audio deepfakes. In addition to the technological diversity, the dataset also offers the widest linguistic range to date with over 35 languages compared to currently publicly available datasets, most of which only contain English or Chinese audio tracks.
Diversity and balance are the key to success, and not just when it comes to training data for AI recognition tools. Also in the fight against the negative consequences of audio fakes. «We will only be able to counter the emerging deepfake era with a combination of better technical detection, education and increased awareness among the population as a whole,» Nicolas Müller is convinced.
Recognize audiofakes for free using the Deepfakes Total platform - including training game
Mandy Bartel
Press Officer, Fraunhofer Gesellschaft
fraunhofer.de
This text first appeared in Fraunhofer Magazine 2/25. You can read the entire magazine on security and infrastructure here.

