Nvidia Researcher: OpenAI’s Sora is Amazing, but not because of the Patches
As we’ve seen, OpenAI’s newly released Sora is really good at making high-quality, long videos. In this article, we’re gonna focus more on the technical side, take a look at its training details, and unfold what Sora’s real tech novelty is.
Previous approaches
AI researchers have explored video data generation through various methods.
- Recurrent networks (RNNs)
- Generative adversarial networks (GANs)
- Autoregressive transformers
- Diffusion models
Previous methods typically target specific visual categories or shorter, fixed-sized videos. RNN and GAN methods can basically be ignored due to their poor performance.
LLM for Videos?
Shortly after the news, OpenAI published a technical article. They mention that, inspired by large language models that utilize text tokens, Sora employs similar ideas of visual patches blablabla.
Well, the article has simplified the methodology a lot for simplicity. Using ViT for video generation and the idea of “patches” or tokenization is not a new thing.
Therefore, I asked Nvidia’s researcher about Sora’s novelty. He reminds me of the academic context where researchers long debated the effectiveness of ViTs and diffusion-based models.
In recent years, ViT seems to become THE model of mainstream visual architecture, except that UNet still dominates the field of diffusion models. From DALL·E 2 to Stable Diffusion, UNet is widely used in text-to-image models.
In 2023, Google proposed MAGVIT, “a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary.”
The MAGVIT v2 paper is named Language Model Beats Diffusion: Tokenizer is Key to Visual Generation. The title tells the exact “war” between the two hotly debated methods. The proposed transformer-based model was integrated into Google’s VideoPoet later, a large language model for zero-shot video generation.
In that sense, yes, Sora proved that using a diffusion transformer (DiT) model can achieve surprisingly good results. We’ll explain more about it.
Architecture
As proposed in the paper Scalable Diffusion Models with Transformers, DiT is a neural network structure that combines the benefits of both ViT and diffusion models.
DiT = VAE encoder + ViT + DDPM + VAE decoder
Saining Xie, 2nd author of DiT, guesses and predicts in his Tweet:
- Sora might have ~3B parameters, deduced from calculations on batch size. “Training the Sora model might not require as many GPUs as one would anticipate; I would expect very fast iterations going forward.”
- Sora “might also use Patch n’ Pack (NaViT) from Google, to make DiTs adaptable to variable resolutions/durations/aspect ratios.”
Video Compression
According to Sora, raw videos were compressed to a compact latent representation in both time and space; there’s a decoder model to revert the generated latent back to pixel space.
Saining comments that, “Looks like it’s just a VAE but trained on raw video data.”
Text understanding
- The team reveals that they trained a descriptive captioner model to generate captions for videos, first proposed in DALL·E 3’s paper. This improved Sora’s text understanding and overall video quality.
- They also leveraged GPT to transform short text prompts from users into detailed captions. Rewriting prompts is almost a standard practice for AI products today to bridge the gap between user instruction and model behavior.
Training data
- Sora was trained on videos at their native aspect ratios for better composition and framing.
- People also guess that Sora’s training involves synthetic visual data from game engines. The extensive use of synthetic data must have taken an important role in Sora’s training.
Takeaways
We all see that, Sora moves the camera smoothly, keeps things looking the same even if they’re far away, remembers where objects are, and makes it seem like things in the video are interacting with each other and the background.
Diffusion model wins
Sora proved that the scaling law also holds for diffusion-based video generation methods. Using both diffusion models and the idea of tokenizations, they were able to achieve amazing generation results.
Long general videos
A major advancement in Sora is its capability to create very long videos; the methodologies for making a 5-second video and a minute one are fundamentally different.
Before Sora, researchers wondered if generating long consistent videos would require methods tailored to specific categories or even with complex physics simulators. Sora’s results tell us that this can be achieved with end-to-end, general-purpose model training if done properly.
More implications
The video generation will also help many other machine learning tasks in 3D generation, autonomous driving, and robotics, and will eventually be able to simulate the physical world.
Challenges
As video generation continues to advance, the next frontier lies in tackling the issue of error accumulation and ensuring sustained quality and consistency over time.
Let’s test out how Sora performs when it launches, and hopefully we can conclude more from there!