Meta’s Voicebox AI is a Dall-E for text-to-speech




Andrew Tarantola

Andrew Tarantola

 



Our reverence towards stars and celebrities was not borne of the 19th century’s cinematic revolution, but rather has been a resilient aspect of our culture for millennia. Ancient tales of immortal gods rising again and again after fatal injury, the veneration and deification of social and political leaders, Madame Tussauds’ wax museums and the Academy Awards’ annual In Memoriam segment, they’re are all facets of the human compulsion to put well-known thought leaders, tastemakers and trendsetters up on pedestals. And with a new, startlingly lifelike generation of generative artificial intelligence (gen-AI) at our disposal, today’s celebrities could potentially remain with us long after their natural deaths. Like ghosts, but still on TV, touting Bitcoin and Metaverse apps. Probably.


Fame is the name of the game


American Historian Daniel Boorstin once quipped, “to be famous is to be well known for being well-known.” With the rise of social media, achieving celebrity is now easier than ever, for better or worse.


“Whereas stars are often associated with a kind of meritocracy,” Dr. Claire Sisco King, Associate Professor of Communication Studies and Chair of the Cinema and Media Arts program at Vanderbilt. “Celebrity can be acquired through all kinds of means, and of course, the advent of digital media has, in many ways, changed the contours of celebrity because so-called ordinary people can achieve fame in ways that were not accessible to them prior to social media.”


What’s more, social media provides an unprecedented degree of access and intimacy between a celebrity and their fans, even at the peak of the paparazzi era. “We develop these imagined intimacies with celebrities and think about them as friends and loved ones,” King continued. “I think that those kinds of relationships illustrate the longing that people have for senses of connectedness and interrelatedness.”






For as vapid as the modern celebrity existence is portrayed in popular media, famous people have long served important roles in society as trend-setters and cultural guides. During the Victorian era, for example, British folks would wear miniature portraits of Queen Victoria to signal their fealty and her choice to wear a white wedding gown in 1840 is what started the modern tradition. In the US, that manifests with celebrities as personifications of the American Dream — each and every single one having pulled themselves up by the bootstraps and sworn off avocado toast to achieve greatness, despite their humble beginnings presumably in a suburban garage of some sort.


“The narratives that we return to, “ King said, “can become comforts for making sense of that inevitable part of the human experience: our finiteness.” But what if our cultural heroes didn’t die? At least not entirely? What if, even after Tom Hanks shuffles off this mortal coil, his likeness and personality were digitally preserved in perpetuity? We’re already sending long-dead recording artists like Roy Orbison, Tupac Shakur and Whitney Houston back out on tour as holographic performers. The Large Language Models (LLMs) that power popular chatbots like ChatGPT, Bing Chat, and Bard, are already capable of mimicking the writing styles of whichever authors they’ve been trained on. What’s to stop us from smashing these technologies together into an interactive Tucker-Dolcetto amalgamation of synthesized content? Turns out, not much beyond the threat of a bad news cycle.


How to build a 21st century puppet


Cheating death has been an aspirational goal of humanity since prehistory. The themes of resurrection, youthful preservation and outright immortality are common tropes throughout our collective imagination — notions that have founded religions, instigated wars, and launched billion dollar beauty and skin care empires. If a society’s elites weren’t mummifying themselves ahead of a glorious afterlife, bits and pieces of their bodies and possessions were collected and revered as holy relics, cultural artifacts to be cherished and treasured as a physical connection to the great figures and deeds of yore.


Technological advances since the Middle Ages have, thankfully, by and large eliminated the need to carry desiccated bits of your heroes in a coat pocket. Today, fans can connect with their favorite celebrities — whether still alive or long-since passed — through the star’s available catalog of work. For example, you can watch Robin Williams’ movies, stand up specials, Mork and Mindy, and read his books arguably more easily now than when he was alive. Nobody’s toting scraps of hallowed rainbow suspender when they can rent Jumanji from YouTube on their phone for $2.99. It’s equally true for William Shakespeare, whose collected works you can read on a Kindle as you wait in line at the DMV.


At this point, it doesn’t really matter how long a beloved celebrity has been gone — so long as sufficiently large archives of their work remain, digital avatars can be constructed in their stead using today’s projection technologies, generative AI systems, and deepfake audio/video. Take the recent fad of deceased singers and entertainers “going back out on tour” as holographic projections of themselves for example.


The projection systems developed by BASE Hologram and the now-defunct HologramUSA, which made headlines in the middle of the last decade for their spectral representations of famously deceased celebrities, used a well-known projection effect known as Pepper’s Ghost. Developed in the early 19th century by British inventor John Henry Pepper, the image of an off-stage performer is reflected onto a transparent sheet of glass interposed between the stage and audience to produce a translucent, ethereal effect ideal for depicting the untethered spirits that routinely haunted theatrical protagonists at the time.


Turns out, the technique works just as well with high-definition video feeds and LED light sources as it did with people wiggling in bedsheets by candlelight. The modern equivalent is called the “Musion Eyeliner” and rather than a transparent sheet of glass, it uses a thin metalized film set at a 45 degree angle towards the audience. It’s how the Gorillaz played “live” at the 2006 Grammy Awards and how Tupac posthumously performed at Coachella in 2012, but the technology is limited by the size of the transparent sheet. If we’re ever going to get the Jaws 19 signage Back to the Future II promised us, we’re likely going to use arrays of fan projectors like those developed by London-based holographic startup, Hypervsn, to do so.


“Holographic fans are types of displays that produce a 3-dimensional image seemingly floating in the air using the principle of POV (Persistence of Vision), using strips of RGB LEDs attached to the blades of the fan and a control-unit lighting up the pixels,” Dr Priya C, Associate Professor at Sri Sairam Engineering College, and team wrote in a 2020 study on the technology. “As the fan rotates, the display produces a full picture.”


Dr Priya C goes on to say “Generally complex data can be interpreted more effectively when displayed in three dimensions. In the information display industry, three dimensional (3D) imaging, display, and visualization are therefore considered to be one of the key technology developments that will enter our daily life in the near future.”


“From a technical standpoint, the size [of a display] is just a matter of how many devices you are using and how you actually combine them,” Hypervsn Lead Product Manager, Anastasia Sheluto, told Engadget. “The biggest wall we have ever considered was around 400 devices, that was actually a facade of one building. A wall of 12 or 15 [projectors] will get you up to 4k resolution.” While the fan arrays need to be enclosed to protect them from the elements and the rest of us from getting whacked by a piece of plastic revolving at a few thousand RPMs, these displays are already finding use in museums and malls, trade shows and industry showcases.



What’s more, these projector systems are rapidly gaining streaming capabilities, allowing them to project live interactions rather than merely pre-recorded messages. Finally, Steven Van Zandt’s avatar in the ARHT Media Holographic Cube at Newark International will do more than stare like he’s not mad, just disappointed, and the digital TSA assistants of tomorrow may do more than repeat rote instructions for passing travelers as the human ones do today.


Getting Avatar Van Zandt to sound like the man it’s based on is no longer much of a difficult feat either. Advances in the field of deepfake audio, more formally known as speech synthesis, and text-to-speech AI, such as Amazon Polly or Speech Services by Google, have led to a commercialization of synthesized celebrity voice overs.


Where once a choice between Morgan Freeman and Darth Vader reading our TomTom directions was considered bleeding-edge cool, today, companies like Speechify offer voice models from Snoop Dogg, Gwyneth Paltrow, and other celebs who (or whose estates) have licensed their voice models for use. Even recording artists who haven’t given express permission for their voices to be used are finding deep fakes of their work popping up across the internet.


In Speechify’s case at least, “our celebrity voices are strictly limited to personal consumption and exclusively part of our non-commercial text-to-speech (TTS) reader,” Tyler Weitzman, Speechify Co-Founder and Head of AI, told Engadget via email. “They’re not part of our Voice Over Studio. If a customer wants to turn their own voice into a synthetic AI voice for their own use, we’re open to conversations.”


“Text-to-speech is one of the most important technologies in the world to advance humanity,” Weitzman continued. “[It] has the potential to dramatically increase literacy rates, spread human knowledge, and break cultural barriers.”


ElevenLabs’ Prime Voice AI software similarly can recreate near perfect vocal clones from uploaded voice samples — the entry level Instant Voice Cloning service only requires around a minute of audio but doesn’t utilize actual AI model training (limiting its range of speech) and an enterprise version that can only be accessed after showing proof that the voice they’re cloning is licensed for that specific use. What’s more, “Cloning features are limited to paid accounts so if any content created using ElevenLabs is shared or used in a way that contravenes the law, we can help trace it back to the content creator,” ElevenLabs added.


The Enterprise-grade service also requires nearly 3 hours of input data to properly train the language model but company reps assure Engadget that, “the results are almost indistinguishable from the original person’s voice.” Surely Steve Van Zandt was onscreen for that long over the course of Lillyhammer’s four-season run.


Unfortunately, the current need for expansive, preferably high-quality, audio recordings on which to train an AI TTS model severely limits which celebrity personalities we’d be able to bring back. Stars and public figures from the second half of the 20th century would obviously have far more chance of having three hours of tape available for training than, say, Presidents Jefferson or Lincoln. Sure, a user could conceivably reverse engineer a voiceprint from historical records — ElevenLabs Voice Design allows users to generate unique voices with adjustable qualities like age, gender, or accent — and potentially recreate Theodore Roosevelt’s signature squeaky sound, but it’ll never be quite the same as hearing the 26th President himself.


Providing something for the synthesized voices to say is proving to be a significant challenge — at least providing something historically accurate, as the GPT-3-powered iOS app, Historical Figures Chat has shown. Riding the excitement around ChatGPT, the app was billed as able to impersonate any of 20,000 famous folks from the annals of history. Despite its viral popularity in January, the app has been criticized by historians for returning numerous factual and characteristic inaccuracies from its figure models. Genocidal Cambodian dictator, Pol Pot, at no point in his reign showed remorse for his nation’s Killing Fields, nor did Nazi general and Holocaust architect, Heinrich Himmler, but even gentle prodding was enough to have their digital recreations begin spouting mea culpas.


“It’s as if all of the ghosts of all of these people have hired the same PR consultants and are parroting the same PR nonsense,” Zane Cooper, a researcher at the University of Pennsylvania, remarked to the Washington Post.


We can, but should we?


Accuracy issues aren’t the only challenges generative AI “ghosts” currently face, as apparently, even death itself will not save us from copyright and trademark litigation. “There’s already a lot of issues emerging,” Dan Schwartz, partner and IP trial lawyer at Nixon Peabody, told Engadget. “Especially for things like ChatGPT and generative AI tools, there will be questions regarding ownership of any intellectual property on the resulting output.


“Whether it’s artwork, whether it’s a journalistic piece, whether it’s a literary piece, whether it is an academic piece, there will be issues over the ownership of what comes out of that,” he continued. “That issue has really yet to be defined and I think we’re still a ways away from intellectual property laws fully having an opportunity to address it. I think these technologies have to percolate and develop a little bit and there will be some growing pains before we get to meaningful regulation on them.”


The US Copyright Office in March announced that AI-generated art cannot be copyrighted by the user under US law, equating the act of prompting the computer to produce a desired output with asking a human artist the same. “When an AI technology receives solely a prompt from a human and produces complex written, visual, or musical works in response, the ‘traditional elements of authorship’ are determined and executed by the technology — not the human user,” the office stated.


This is the opposite of the stance taken by a Federal Appeals Court. “[Patent law regarding AI] for the most part, is pretty well settled here in the US,” Schwartz said, “that an AI system cannot be an inventor of a new, patentable invention. It’s got to be a human, so that will impact how people apply for patents that come out of generative AI tools.”


Output-based infringement aside, the training methods used by firms like OpenAI and Stability AI, which rely on trawling the public web for data with which to teach their models, have proven problematic as well, having repeatedly caught lawsuits for getting handsy with other people’s licensed artwork. What’s more, generative AI has already shown tremendous capacity and capability in creating illegal content. Deepfake porn ads featuring the synthetic likenesses of Emma Watson and Scarlett Johansson ran on Facebook for more than two days in March before being flagged and removed, for example.


Until the wheels of government can turn enough to catch up to these emerging technologies, we’ll have to rely on market forces to keep companies from disrupting the rest of us back into the stone age. So far, such forces have proved quick and efficient. When Google’s new Bard system immediately (but confidently) fumbled basic facts about the James Webb Space Telescope, that little whoopsie-doodle immediately wiped $100 billion off the company’s stock value. The Historical Figures Chat app, similarly, is no longer available for download on the App Store, despite reportedly receiving multiple investment offers in January. It has since been replaced with numerous, similarly-named clone apps.


“I think what is better for society is to have a system of liability in place so that people understand what the risks are,” Schwartz argued. “So that if you put something out there that creates racist, homophobic, anti-any protected class, inappropriate content, whoever’s responsible for making that tool available, will likely end up facing the potential of liability. And I think that’s going to be pretty well played out over the course of the next year or two.”


Celebrity as an American industry


While the term “celebrity” has been around since being coined in 17th century France, during the days of John Jacques Rousseau, it was the Americans in the 20th century who first built the concept into a commercial enterprise.


By the late 1920s, with the advent of Talkies, the auxiliary industry of fandom was already in full swing. “You [had] fan magazines like Motion Picture, Story Magazine or Photoplay that would have pictures of celebrities on the cover, have stories about celebrities behind the scenes, stories about what happened on the film set,” King explained. “So, as the film industry develops alongside this, you start to get Hollywood Studios.” And with Hollywood Studios came the star system.


“Celebrity has always been about manufacturing images, creating stories,” King said. The star system existed in the 1930s and ‘40s and did to young actors and actresses what Crypton Future Media did to Hatsune Miku: it assembled them into products, constructing synthetic personalities for them from the ground up.


Actors, along with screenwriters, directors and studio executives of the era, would coordinate to craft specific personas for their stars. “You have the ingénue or the bombshell,” King said. “The studios worked really closely with fan magazines, with their own publicity arms and with gossip columnist to tell very calculated stories about who the actors were.” This diverted focus from the film itself and placed it squarely on the constructed, steerable, personas crafted by the studio — another mask for actors to wear, publicly and even after the cameras were turned off.


“Celebrity has existed for centuries and the way it exists now is not fundamentally different from how it used to be,” King added. “But it has been really amplified, intensified and made more ubiquitous because of changing industry and technological norms that have developed in the 20th and 21st centuries.”


Even after Tom Hanks is dead, Tom Hanks Prime will live forever


Between the breakneck pace of technological advancement with generative AI (including deepfake audio and video), the promise of future “touchable” plasma displays offering hard light-style tactile feedback through femtosecond laser bursts, and Silicon Valley’s gleeful disregard towards the negative public costs borne from their “disruptive” ideas, the arrival of immortal digitized celebrities hawking eczema creams and comforting lies during commercial breaks is now far more likely a matter of when, rather than if.


But what does that mean for celebrities who are still alive? How will knowing that even after the ravages of time take Tom Hanks from us, that at least a lightly interactable likeness might continue to exist digitally? Does the visceral knowledge that we’ll never truly be rid of Jimmy Fallon empower us to loathe him even more?


“This notion of the simulacra of the celebrity, again, is not entirely new,” King explained. “We can point to something like the Madame Tussaud’s wax museum, which is an attempt to give us a version of the celebrity, there are impersonators who dress and perform as them, so I think that people take a certain kind of pleasure in having access to an approximation of the celebrity. But that experience never fully lives up.”


“If you go and visit the Mona Lisa in the Louvre, there’s a kind of aura [to the space],” she continued. “There’s something intangible, almost magical about experiencing that work of art in person versus seeing a print of it on a poster or on a museum tote bag or, you know, coffee mug that it loses some of its kind of ineffable quality.”


 

Engadget is a web magazine with obsessive daily coverage of everything new in gadgets and consumer electronics   

(27)