AI-generated speech: The Sound of Dystopia
When it comes to AI-generated speech, I find myself continually struck by one particular gap in the conversation, which can be simply summarised:
I don’t want to live in a sci-fi dystopia. And I don’t think you do either.
Our voices are about as unique as fingerprints — similar on the surface, but with endless variations. As humans, we each essentially produce sound in the same physiological way, but no two larynges are the same. Add on top: regional affection, cultural influence, education, age, gender, and life paths, and it’s easy to see how our vocal makeup is effortlessly one of the most unique and honest things about us.
Fake voice. Fake message.
This is why it feels so dystopian when companies and creators use synthetic voices. Companies will expend inordinate effort on marketing and branding, yet when given the opportunity to select the literal voice they want to put out in the world to represent them, they choose something fake.
It doesn’t take a lot of subtextual analysis to see why that might be a problem for a business trying to grow and retain positive recognition. As an audience member myself, I question the integrity of the messages coming from a faceless, voiceless organisation. If the “person” speaking is not actually real, what indication do I have to trust anything they are saying?
Perhaps this scepticism derives, in part, from the awareness that not a single human being has had to sit down and believe the script they’re reading. Not one. I can’t help but envision the writing process of this auto-generated noise: was it originally a copywriter who was given a rigid sales strategy? Was it picked apart by a board of directors until it was deemed adequately persuasive? When the speech rings hollow, so do the words.
Gremlin // Getty Images
Homogenising into one shared voice
Contrary to many of us now embracing our own individuality, AI is doing the exact opposite. It uses a combination of everything that already exists, copying this content into a requested format in order to produce something exceptionally average. When it comes to voice generation, this could present larger problems than creating rubbish work. Let me explain.
The more frequently we use AI-generated speech, the higher the percentage of the voiceover online will be generated by AI. Pretty obvious. However, this means that at some time in the future there could be a tipping point, whereby instead of learning from human speech sounds alone, new AI-generated speech may learn from previous AI-generated speech. As a result, the original voices used by AI would take up a disproportionate share of the online voice space – eventually homogenising the sound of the internet as a whole.
Severance // Apple TV Trailer
Stunted evolution off-line
Even offline, I predict that we will eventually start to emulate these voices. This digital globalisation is already happening in other ways: Americanisms are being adopted throughout the world from people watching US television. “Internet speak”. Memes.
As of 2022, the average daily social media usage of internet users worldwide went up to almost 2 hours, daily. This is likely more time than most people are spending socialising with people offline, and so it is reasonable to assume our speech will be more affected by those of voices online than by those around us.
If this is true, we will likely all gradually devolve to using one shared accent – largely informed by AI-generated speech. And the greater share we allow computers to contribute towards our average modes of spoken expression, the more humanity we will lose in how we speak.
Evolution of humanity requires humans.
Ultimately, this is the true price of AI-generated speech. Dystopia is often characterised by restriction of language and communication. For example, in George Orwell’s 1984, they enforced “Newspeak”, to over-simply vocabulary – thereby hindering society’s aptitude for critical discussion. In The H. G. Wells’ novel, The Time Machine, the morlocks’ mouths are smaller - rendering them incapable of effective speech. In Ray Bradbury’s Fahrenheit 451, they burn books.And today, we are severing small collections of human sounds to replace all real voices.
While I appreciate that AI-generated speech is improving and becoming more realistic – this entirely misses the point. It is not real. A voice is trusted because it represents something. It comes from somewhere. It carries with it the resonance of a shared understanding on what it is to live. Speech is just speech. So for the sake of our shared humanity, I hope you agree.