If you witnessed the first attempts at Generative AI video a year ago, including ‘Will Smith Eating Spaghetti’, I share your horror. If you haven’t, trust me, don’t Google it. As with all things AI, the technology has experienced a generational leap in fidelity in the year since those early demonstrations.
People who are part of the Hugging Face AI community will have had a heads-up about what was coming, but many members of the general public lost their minds when OpenAI took to X (formerly Twitter) in February to share SORA, its text-to-video Generative AI Model.
The set of videos, created by text prompts, showed a selection of scenes, actions, and art styles between nine seconds and a minute long. Of course, these demonstrations were cherry-picked, but they were simply astonishing.
Using the prompt: ‘A movie trailer featuring the adventures of a 30-year-old spaceman wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors’, SORA was able to generate a video (above) with multiple scenes, as well as realistic cinematic lighting and camera movements, including human-like characters moving within each scene.
In another video (below), SORA created a night scene in Tokyo, with a lady in a red dress, leather jacket and sunglasses walking down the street as the camera tracks backwards capturing her movements. The wet floor reflects the neon street signs and shadows of passers-by. The main character’s movement is a little off, but it’s an impressive demonstration of something generated from text without any editing or corrections. It’s difficult to communicate exactly how compelling each of these videos are, so check them out on the links to the original X (Twitter) posts.
Since the worldwide reveal of SORA, other leaders and innovators in the space of generative AI have begun sharing demonstrations of Generative video that go further than OpenAI’s impressive model. The EMO: Emote Portrait Alive audio-to-video diffusion model, created by the Institute for Intelligent Computing, Alibaba Group is one of the most mind-blowing models I’ve seen.
EMO can take a still image and animate it based on an audio source. This would be impressive enough if the model simply moved the subject’s mouth to portray a realistic lip sync, but EMO goes further. Through a two-stage process that trains the model to faithfully map motion to the face of the subject, EMO understands the context of the subject’s facial expression, maintaining the emotion and contextual tone while animating the face to deliver the source audio.
We’re rapidly moving into a world where any image of a person, whether still or in motion, will be easily remixed to produce new, compelling content. The optimist in me wonders what opportunities we may have to revisit old content and animate them with audio-to-video and text-to-video AI models. There will, of course, be concerns around authenticity and deception, as content produced using these models and their offspring will soon become indistinguishable from reality.
Are you inspired or terrified? I’m a little bit of both.
Read more of Jon Devo's Scanning Ahead blogs