Dangerous copies: how AI learned to fake voices

The development of artificial intelligence technologies has led to the emergence of sophisticated speech synthesis systems. Today, they are used in a variety of areas, from the film industry to medicine, but at the same time they create new challenges for fact-checkers.

Digital voice-faking technology has evolved alongside generative AI. Early experiments began with text-to-speech (TTS) systems, but the quality remained robotic, and each voice required hours of professional recording to create. The real breakthrough happened with the advent of neural networks. At the same time, the time needed to create a convincing copy of a voice has been reduced from several hours in the 2010s to a few seconds today, and the quality has reached a level where even experts cannot always distinguish synthesized speech from natural speech. Let’s consider the prospects of this technology.

What are the benefits of voice deepfakes?

Voice deepfakes are a tool, and like any powerful tool, they can be used in many ways. Modern voice synthesis capabilities present vast opportunities in many socially significant areas.

  • Film industry

– Dubbing of films and TV series — AI can dub actors in different languages, preserving their original intonation.
– Voice rejuvenation — for example, for flashbacks, where the hero must sound like he was younger (like in The Mandalorian with Luke Skywalker’s voice).
– Completing unfinished work — if an actor has died or is unable to record lines (as with Anthony Bourdain’s voice in the documentary about him).
– Personalized content — imagine an audio guide in a museum with the voice of your favorite historian or actor.

  • Voice assistants and technologies

Personalization of assistants — you can create an AI voice that sounds like your friend, family member, or even yourself.

– Accessibility — people with speech impairments can use a synthesized voice that sounds like their own (like Stephen Hawking, but more natural sounding).

  • Medicine and rehabilitation

– Voice restoration — services such as Voice Keeper are created for patients at risk of voice loss. The technology allows you to record your voice in advance and create a personalized synthesized version. This voice can then be used with communication devices.

  • Education and content

– Learning languages — you can hear the correct pronunciation from a native speaker or even “talk” with historical figures.
– Lectures and podcasts — teachers and bloggers can scale content without recording each video manually.

  • Business and services

– Automation — instead of recording hundreds of audio messages for advertising, you can synthesize them in minutes.

– Localization — international projects adapt to new markets faster without hiring voice actors.


What are the dangers of voice fakes?

Imagine that your boss, relative or close friend calls you and asks you to urgently transfer money, pass on personal data or even help in a critical situation. You hear the voice of this person, his usual intonation, manner of speaking, and you have no reason to doubt the authenticity of the call. However, after some time it turns out that the person you were talking to did not know about this call at all. It was artificial intelligence that completely faked the voice, reproduced its timbre, emotions and speech features. This is already possible today, and such cases are happening more and more often.

A voice deepfake is not just a recording that reproduces someone’s voice, it is a tool that can imitate any phrases that the person did not even say. Imagine: an audio recording appears on the Internet in which a famous politician makes provocative statements, and people begin to believe it. Such technologies can be used to fabricate incriminating evidence, create false confessions or set-ups. We live in a world where information spreads at lightning speed, and it is becoming increasingly difficult to verify its veracity.

Another example: users of a well-known web forum created deepfake audio recordings in which celebrities allegedly uttered offensive and provocative phrases. These recordings quickly spread across the Internet, and even after they were exposed, many people continued to believe in them.

To understand how dangerous voice deepfakes are, you need to understand how they work. Artificial intelligence is trained on speech samples, analyzing timbre, rhythm, pauses, and even individual pronunciation errors. The more data it receives, the more accurate the model becomes at reproducing a person’s voice.

Today, to create a deepfake, all you need is a few seconds of audio recording, such as a voice message or an interview.

Algorithms like Microsoft’s VALL-E and ElevenLabs can create a voice model that doesn’t just repeat words, but adapts to emotion and context.

Generative adversarial networks (GANs) are even more complex: one neural network creates a fake, and another checks it for authenticity, making the result increasingly realistic. The result is a voice that is indistinguishable from the original, even if you listen carefully.

There are already algorithms that can automatically create text and voice it in the voice of a famous person, adjusting the speech to the context of the conversation. This means that criminals can not only forge existing recordings, but also generate new dialogues that never existed.

But that’s not all. One of the most worrying directions voice deepfakes are taking is their integration with chatbots and voice assistants. Such systems are already being tested, and while their official goal is to create more natural voice interfaces, they can also be used in fraudulent schemes. But what if a voice assistant can not only reproduce someone’s voice, but also conduct a dialogue, adapting to the interlocutor’s answers? This is no longer just a recorded voice, but a dynamic fake that can persuade, confuse and even manipulate. Read about how exactly attackers use synthesized voices and how to counter this in the following educational materials from GFCN.