So what if OpenAI Sora didn’t create the mind-blowing Balloon Head video without assistance – I still think it’s incredible
Sora fans just learned a hard lesson: filmmakers will be filmmakers and will do what’s necessary to make their creations as convincing and eye-popping as possible. But if this made them think less of OpenAI‘s generative AI video platform, they’re wrong.
When OpenAI handed an early version of the generative Video AI platform to a bunch of creatives, one team – Shy Kids – created an unforgettable video of a man with a yellow balloon for a head. Many declared Air Head to be a weird and powerful breakthrough, but a behind-the-scenes video has cast a rather different spin on it. And it turns out that as good as Sora is at generating video from test prompts, there were many things that the platform either couldn’t do or didn’t produce just as the filmmakers wanted.
The video’s post-production editor Patrick Cederberg offered, in an interview with FxGuide, a lengthy list of changes Cederberg’s team made to Sora’s output to create the stunning effects we saw in the final, 1-minute, 22-second Air Head video.
Sora’s developers, for instance, included no understanding of typical film shots like panning, tracking, and zooming, so the team sometimes had to create a pan and tilt shot out of the existing more static clip.
Plus, while Sora is capable of outputting lengthy videos based on long text prompts, there is no guarantee that the subjects in each prompt will remain consistent from one output clip to another. It took considerable work and experimentation in prompts to get videos that connected disparate shots into a semi-connected whole.
As Cederberg notes in an Air Head Behind the Scenes video “What ultimately you’re seeing took work time and human hands to get it looking semi-consistent.”
The balloon head sounds particularly challenging, as Sora understands the idea of a balloon but doesn’t base its output on, say, an individual video or photo of a balloon. In Sora’s original idea, every balloon had a sting attached; Cederberg’s team had to paint that out of each frame. More frustratingly, Sora often wanted to put the impression (see above), outline, or drawing of a face on the balloons. And while the final video features a yellow balloon in each shot, the Sora output usually had different balloon colors that Shy Kids would adjust in post.
Shy Kids told FxGuide that all the video they used is Sora output, it’s just that if they had used the video untouched, the film would’ve lacked the continuity and cohesion of the final, wistful product.
This is good news
Does this news turn the charming Shy Kids video into Sora’s Milkshake Duck? Not necessarily.
If you look at some of the unretouched videos and images in the Behind the Scenes video, they’re still remarkable and while post-production was necessary, Shy Kids never shot a single bit of real film to produce the initial images and video.
Even as AI innovation races forward and we see huge generational leaps as often as every three months, AI of almost any stripe is far from perfect. ChatGPT’s responses are usually accurate, but can still miss the context and get basic facts wrong. With text-to-imagery, the results are even more varied because, unlike AI-generated text response – which can use fact-based sources and mostly predicts the right next word – generative imaging base their output on a representation of that idea or concept. That’s particularly true of diffusion models that use training information to figure out what something should look like, which means that output can vary wildly from image to image.
“It’s not as easy as a magic trick: type something in and get exactly what you’re hoping for,” Shy Kids Producer Syndey Leeder says in the Behind the Scenes video.
These models may have a general idea of what a balloon or person looks like. Asking such a system to imagine a man on a bike six times will get you six different results. They may all look good, but it’s unlikely the man or bicycle will be the same in every image. Video generation likely compounds the issue, with the odds of maintaining scene and image consistency across thousands of frames and from clip to clip extremely low.
With that in mind, Shy Kids’ accomplishment is even more noteworthy. Air Heads manages to maintain both the otherworldliness of an AI video and a cinematic essence.
This is how AI should work
Automation doesn’t mean the complete removal of human intervention. This is as true for videos as it is on the factory floor, where the introduction of robots has not meant people-free production. I vividly recall Elon Musk’s efforts to automate as much of the Tesla Model 3’s production as possible. It was a near disaster and production went more smoothly when he added back the humanity.
A creative process such as filmmaking or production will always require the human touch. Shy Kids needed an idea before they could start feeding it to Sora. And when Sora didn’t understand their intentions, they had to adjust the output by hand. As most creative endeavors do, it became a partnership, one where the accomplished Sora AI provided a tremendous shortcut, but one that still didn’t take the project to completion.
Instead of bursting Air Head‘s bubble, these revelations remind us that the marriage of traditional media and AI still requires a human’s guiding hand and that’s unlikely to change – at least for the time being.