Nvidia Researcher: OpenAI’s Sora is Amazing, but not because of the Patches

5 min readFeb 17, 2024

As we’ve seen, OpenAI’s newly released Sora is really good at making high-quality, long videos. In this article, we’re gonna focus more on the technical side, take a look at its training details, and unfold what Sora’s real tech novelty is.

Previous approaches

AI researchers have explored video data generation through various methods.

Recurrent networks (RNNs)
Generative adversarial networks (GANs)
Autoregressive transformers
Diffusion models

Previous methods typically target specific visual categories or shorter, fixed-sized videos. RNN and GAN methods can basically be ignored due to their poor performance.

LLM for Videos?

In one of OpenAI Sora’s demos, complex lighting and reflections were handled as if in the real world.

Shortly after the news, OpenAI published a technical article. They mention that, inspired by large language models that utilize text tokens, Sora employs similar ideas of visual patches blablabla.

Well, the article has simplified the methodology a lot for simplicity. Using ViT for video generation and the idea of “patches” or tokenization is not a new thing.

Google’s ViViT (2021) proposed pure-transformer-based models for video classification. (https://arxiv.org/pdf/2103.15691.pdf)

Therefore, I asked Nvidia’s researcher about Sora’s novelty. He reminds me of the academic context where researchers long debated the effectiveness of ViTs and diffusion-based models.

In recent years, ViT seems to become THE model of mainstream visual architecture, except that UNet still dominates the field of diffusion models. From DALL·E 2 to Stable Diffusion, UNet is widely used in text-to-image models.

In 2023, Google proposed MAGVIT, “a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary.”

The MAGVIT v2 paper is named Language Model Beats Diffusion: Tokenizer is Key to Visual Generation. The title tells the exact “war” between the two hotly debated methods. The proposed transformer-based model was integrated into Google’s VideoPoet later, a large language model for zero-shot video generation.

In that sense, yes, Sora proved that using a diffusion transformer (DiT) model can achieve surprisingly good results. We’ll explain more about it.

Architecture

As proposed in the paper Scalable Diffusion Models with Transformers, DiT is a neural network structure that combines the benefits of both ViT and diffusion models.

DiT = VAE encoder + ViT + DDPM + VAE decoder

ICCV 2023 paper proposes a diffusion model with a transformer Backbone. (https://arxiv.org/pdf/2212.09748.pdf)

Saining Xie, 2nd author of DiT, guesses and predicts in his Tweet:

Sora might have ~3B parameters, deduced from calculations on batch size. “Training the Sora model might not require as many GPUs as one would anticipate; I would expect very fast iterations going forward.”
Sora “might also use Patch n’ Pack (NaViT) from Google, to make DiTs adaptable to variable resolutions/durations/aspect ratios.”

Video Compression

According to Sora, raw videos were compressed to a compact latent representation in both time and space; there’s a decoder model to revert the generated latent back to pixel space.

An illustration borrowed from Google’s ViViT paper, an example of spacetime encoding.

Saining comments that, “Looks like it’s just a VAE but trained on raw video data.”

Text understanding

The team reveals that they trained a descriptive captioner model to generate captions for videos, first proposed in DALL·E 3’s paper. This improved Sora’s text understanding and overall video quality.
They also leveraged GPT to transform short text prompts from users into detailed captions. Rewriting prompts is almost a standard practice for AI products today to bridge the gap between user instruction and model behavior.

Training data

Sora was trained on videos at their native aspect ratios for better composition and framing.
People also guess that Sora’s training involves synthetic visual data from game engines. The extensive use of synthetic data must have taken an important role in Sora’s training.

Takeaways

We all see that, Sora moves the camera smoothly, keeps things looking the same even if they’re far away, remembers where objects are, and makes it seem like things in the video are interacting with each other and the background.

In one of Sora’s demos, the water reflection on the ground looks so real, and views of the human face from different angles are consistent.

Diffusion model wins

Sora proved that the scaling law also holds for diffusion-based video generation methods. Using both diffusion models and the idea of tokenizations, they were able to achieve amazing generation results.

Long general videos

A major advancement in Sora is its capability to create very long videos; the methodologies for making a 5-second video and a minute one are fundamentally different.

Before Sora, researchers wondered if generating long consistent videos would require methods tailored to specific categories or even with complex physics simulators. Sora’s results tell us that this can be achieved with end-to-end, general-purpose model training if done properly.

More implications

The video generation will also help many other machine learning tasks in 3D generation, autonomous driving, and robotics, and will eventually be able to simulate the physical world.

GAIA-1: A Generative World Model for Autonomous Driving proposed a model that generates photorealistic views for ML training in self-driving. (https://arxiv.org/pdf/2309.17080.pdf)

Challenges

As video generation continues to advance, the next frontier lies in tackling the issue of error accumulation and ensuring sustained quality and consistency over time.

Let’s test out how Sora performs when it launches, and hopefully we can conclude more from there!