Sora Roundup

Comparisons, capabilities and rundown

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

On Jerk Day, Thursday, Feb 15th, OpenAI disclosed their text-to-video model: Sora, and the world moved forward again. It was not completely unexpected, as many, many teams across the industry were working on individual aspects of video. But still… it was a great leap forward. It is very hard to generate any video for more than 2 seconds, let alone up to one minute, without any weird morph artifacts and missing or disappearing features.

Comparisons

Capabilities

I just want to take a moment to explore the capabilities of the Sora model, It shows

  • Clear signs of having been trained on the output of a 3D engine

  • It can generate multiple videos in the same “world” at the same time. This means that eventually, you can just imagine a scene from every possible angle, without needing cameras everywhere.

  • Sequential scene changes in the same story world

  • Storytelling

  • Worryingly realistic-looking humans

  • Sora allows video-to-video editing

Same Data Source

The comparisons between Sora and Midjourney revealed that they seemed to have been trained on the same data. When we dream in latent space, we have similar dreams.

In effect, the similarity in training data causes convergence to the same district of latent space. Another example below:

We Don’t Know How To Do This

Meanwhile, Yann LeCun, Facebook’s AI chief, declared in the Middle East just days prior that generative AI would never reach this milestone:

Yann was out and about on Twitter defending his statements, and to be honest, he may still be right in the end, but still, the juxtaposition is a tad embarrassing.

In any case, there was an incredible amount of cope among real-world animators.

Though everyone should know better at this point

Build Alpha

The best information on the Sora build came from the co-author of the underlying paper, Saining Xie:

He goes on to speculate that Sora might only be a 3 billion parameter model, which implies:

  • not that many GPUs utilized for generation

  • fast inference

  • cheap

  • lots more runway to improve

  • and quickly

There are real questions on how closely Sora is simulating reality, with some converting Sora video into 3D scrollable representations known as radiance fields:

OpenAI’s first intern, Dr. Jim Fan, was roundly shouted down, but persisted in that Sora must be performing both world and physics modeling,

Poor Google

Meanwhile, poor Google achieved 5-second videos in late January and has still not released the model to the public. Compare:

The final Sora rundown

  • Leveraging spacetime patches, Sora offers a unified representation for large-scale training across various durations, resolutions, and aspect ratios.

  • It generates high-definition content, showcasing its prowess in handling videos and images with dynamic aspect ratios.

  • It excels in framing and composition, outperforming traditional square-cropped training methods.

  • Utilizing descriptive video captions, Sora achieves higher text fidelity, making it adept at following detailed user prompts for video generation.

  • From animating static images to extending videos, Sora showcases a wide range of editing capabilities.

  • Sora's training reveals emergent properties like 3D consistency and long-range coherence, hinting at its potential as a simulator for the physical and digital world.

All Known Soras

A supercut of all known and confirmed Sora videos with their associated prompts.

Become a subscriber for daily breakdowns of what’s happening in the AI world:

Reply

or to participate.