Can OpenAI Sora Solve 3D Modeling? I Tried This…

Cindy X. L.
5 min readApr 29, 2024

--

In this video, I’ll show you how to go from video to 3d using Gaussian Splatting and share my thoughts as a machine learning researcher.

Conversion Instructions

Download a video generated by Sora.

Then, go to Polycam’s website and choose Gaussian Splatting. Log in.

Upload the video. Wait for a few minutes.

Note: The results are Gaussian splats and can be converted to meshes via tools.

Results 1

The first view was kinda cool. The whole model has a lot of small pieces, not that continuous.

This Japanese street model is also fragmented, with some details, still not that ideal.

I think it’s because the camera moves forward in the videos aka tracking shots. Let’s think about algorithms like NeRF and Gaussian Splatting. They require a series of image views AROUND the subject to learn a 3D representation.

Let’s try some arc shots, or have the camera rotating around the subject. We can only pick from Sora’s existing demos because they’re not open to the public yet.

Results 2

This looks so relevant. The original video was not 360 degrees so the model is limited too.

I thought this one would generate good results, but the output doesn’t make sense at all.

To answer the question, yes, as we see, some of the AI videos can be converted to 3d models. For Polycam, it might have sampled uniformly from the input video to create the multi-views, as if these photos were taken at the same time. Then they apply Gaussian Splatting to compute the 3D representation in a latent space.

Quality Factors

3D consistency

An object needs to maintain its shape and color even from different angles or at different time points. If AI-generated videos do not preserve the trait, then it will be hard to put together inconsistent information for the 3D object.

Keeping Stationary

Remember the assumption- the views are treated as if they are taken at the same time. It’s like when we use iPhone’s panorama to stitch images, sometimes it might not generate good results.

Algorithm Limitations

Third, the conversion has some limitations from the algorithms used, for example, Gaussian Splatting or NeRF. It’ll be hard if the subject is transparent.

Art Styles

If we want stylized 3D models, then the video generation pipeline needs to be able to generate certain art styles. This shouldn’t be challenging as long as we have such training data.

Video vs. 3D Generation

Can text-to-VIDEO models like Sora solve the problem of text-to-3D? From a technical perspective, there is a possibility.

Right now, we see the scenes are not that great for 3D modeling. but remember, these AI videos were for Sora’s demo purposes, not specifically for 3D reconstruction. Also, I only used a basic version of Gaussian splatting via Polycam but there are many other great methods to perform 3D reconstruction. In the future, the results will look much better with new algorithms and category-specific treatments.

Another difference I wanna point out is, all the videos we tried generate 3D SCENEs, and they are different than individual 3D objects. In my previous video, I mentioned that 3D modeling is widely used in the gaming industry. They appreciate having clear 3D objects with clear topologies. Machine learning models like Sora can’t do this yet, so it may not fit that well into the industry’s needs.

Generating 3D scenes is still a big step, and will be super helpful because, right now there are not many algorithms to generate 3D scenes from scratch, like, without scanning any real objects.

In practice, game devs adopt a mainly procedural approach, placing existing 3D models at calculated locations.

The new video-to-3D approach will make it easier to create 3D scenes, good for movies etc. I’m not a camera pro but technically, directors can generate a virtual space by AI, and use a camera in a 3D-rendered scene for filming.

Startup Trends

Luma allows people to scan objects using their smartphones and remix the 3D models. They recently hired a researcher from Google’s text-to-video team of VideoPoet. Is Luma planning something in video generation too? We’ll see.

Genmo AI, another AI startup, started with AI 3D. They soon shifted focus to text-to-video generation. Their founders are top researchers in AI 3D. I wouldn’t be surprised if one day they come back to the 3D generation space again.

If you’re interested in AI especially video and 3D generation, you can follow me, and let’s explore together! See you next time.

--

--

Cindy X. L.
Cindy X. L.

Written by Cindy X. L.

Tech influencer (150k on Weibo), Columbia alum. This is my tiny corner to write about AI, China tech, and creator economy. Views are my own.