Deepfake technology has been a major privacy concern ever since it entered the mainstream public consciousness. In case you didn’t know, a deepfake is a computer-generated face that’s mapped onto another person’s image in a video. Using advanced machine learning methods, you can then realistically transpose one person onto another’s image. Here’s one famous example:
The quality of deepfakes is very high, but they are becoming more and more realistic by the day. Which raises the issue of truth and propaganda. After all, if you can make the President of the United States say anything at all, you can sow plenty of confusion. For example, have a look at this Barack Obama deepfake, played by Jordan Peele.
This works so well because Jordan Peele is a professional impersonator, but what if anyone could fake a perfect voice? That’s the promise of vocal deepfakes, and it could be a total game-changer for this new privacy threat.
A Tale of Two Siris
When I first got to try out Apple’s Siri digital assistant, I was blown away by how natural its voice sounded. That’s not to say that the first generation Siri’s voice was perfect. Anyone could tell that it was not a human voice, but it was so close that it transformed how natural voice synthesis feels.
Today, in 2020, Siri’s voices do feel flawless. In fact, just about every high-end synthesized voice, I’ve heard feels indistinguishable from a real person. Google’s Duplex system sounds human enough that it can fool people over the phone.
This is the first part of the vocal deepfake technology puzzle. We can now create artificial voices that sound real enough that the average person can’t tell them apart from natural human voices. Now the question is whether we can replicate a real person’s voice to the same degree.
Can You Hear Me Now?
The short answer is yes.
Using machine learning and the state of the art when it comes to vocal synthesis, the software can now clone a real person’s voice and make it say anything you want it to.
Consider Lyrebird, which is in private beta as I write this. It’s an AI-based tool that can create a clone of your voice based on a relatively small sample.
That’s one commercial tool, but the overall rise of cloned voice technology is already a cause for concern. Speaking to the Verge, the Federal Trade Commission indicated that they are pretty worried about the potential misuse of this technology. Why? Well, if you can’t imagine the damage voice cloning can do, here are some prime examples.
Using the Naughty Voice
Imagine you get a phone call from your boss, demanding that you urgently transfer money into an account. You recognize their voice over the phone, so you don’t think twice about complying. Later, it turns out your boss never actually phoned you, and it was all a scam.
Now, this sort of scam already happens through email spoofing, and people do fall for it, but with good enough voice cloning, the number of people who’ll be duped is likely to be much greater. That’s one example, of a fairly benign scam, albeit a costly one. What about a phone from your superior officer? A phone call from the President? If you can impersonate anyone, how can anyone trust their respective chain of command?
The flip side is that anyone can now claim that a recording of what they had said has been faked. Which means video evidence could become worthless as these technologies improve. The combination of deepfake visuals and audio promises to make fraud the status quo.
Vocal Deepfake Technology: The Good News
It’s not all bad news, however. There are many positive ways this sort of vocal cloning can be used. With a little imagination, it could be world-changing.
Let’s start with medical applications. Think of someone like the late Professor Stephen Hawking, who relied on a crude vocal synthesizer to be his voice. With this new technology, prior recordings of his voice could be used to reconstruct what he had sounded like before becoming mute. It could be a way to preserve a voice as a digital prosthesis.
Another use, as intended by Lyrebird, is that people who use their voices for creative purposes. Such as podcasters or video creators who do voiceovers. If you can simply feed text and performance instructions to a synthesized voice, the scope for content creation is massive. The written words of authors could be brought back in their own voice, as long as sample recordings exist.
Think about interactive systems and entertainment, Like video games where character voices don’t have to be recorded. Instead, text-to-speech can create dynamic conversations and spoken lines. The same could be applied to AI chatbots.
As with any technology, whether it’s harmful or helpful comes down to how we use it. The profound fact here to keep in mind is that the human voice may soon no longer be the sole domain of human bodies.