AI 3D: Why Cool Algorithms Fail to Profit?
3D modeling is widely used in the media, gaming, and education industries. The global market size of 3D modeling and animation has reached 6 billion dollars (Mordor Intelligence).
In the past year, using AI to create 3D models, has gained much attention and many GenAI companies emerged. I led a research team of several PhDs, finished a prototype, and also talked to quite a few game studios and artists.
In this video, I’m gonna unfold the technical paths and share my views of the AI 3D space as a researcher and a founder.
3D assets
What is a 3D asset composed of?
- Mesh: A collection of vertices, edges, and faces that define the shape of a 3D object.
- Texture: An image applied to the surface of a mesh to give it color, pattern, or surface detail.
- UV Mapping: The process of projecting a 2D texture onto a 3D mesh, which allows the 2D texture to align with the 3D surface correctly.
In addition to these components, 3D modeling encompasses materials, normals, rigging, and animation.
A common 3D asset, such as a house or a tree, can take an experienced artist several days to make. That’s why many founders seek AI-based solutions.
Existing Algorithms
Inspired by the AI works on 2D generation, many early attempts for 3D generation adopted GAN. 3DGAN proposed to generate 3D objects from a probabilistic space.
Similarly, when Contrastive Language-Image Pre-Training (CLIP) was proposed for joint embedding in 2021, the CLIP-based works on 2D generation naturally inspired explorations in 3D.
Then, Dreamfields proposed a text-guided 3D generation using a Neural Radiance Field (NeRF), which is an inverse rendering approach to create 3D models from a set of 2D images.
In 2022, DreamFusion first proposed using a pre-trained 2D text-to-image model for text-guided 3D generation through differentiable rendering. Their core method, Score Distillation Sampling (SDS), samples uniformly in the parameter space from pre-trained diffusion models to obtain gradients that match given text prompts.
One of the most influential papers in AI 3D, DreamFusion inspired many SDS-based improvements. Magic3D further improved the quality and the speed of 3D modeling by a two-step approach. Fantasia3d disentangles geometry and appearance generation, supervised by the SDS loss.
My research team proposed EucliDreamer: for the training loop with SDS loss, depth information is taken into account in addition to the color information. Through the experiments and the user study, we have proved that adding Stable Diffusion depth conditioning to the SDS training can greatly improve the quality and speed of 3D texturing.
Market Insights (Surprising!)
Addressable market and gaming
From a market perspective, the biggest market for AI 3D is gaming. In 2022, the global gaming industry’s revenue was approximately $200 billion. As 3D modeling occupies a considerable proportion of game development, the estimate of the global market size for the production of 3D game assets is over $10 billion.
In contrast, other industries are either too small, like filming with only $30B to address, or rely on 2D formats, like architectural visualization.
Gaps between academia and industry
In the context of AI applications in 3D creation, there are technical gaps between academia and industry.
Today, the majority of the research works in 3D generation focuses on generating 3D models end-to-end, meshes plus textures. AI still struggles to generate clean topology, and many tasks require an understanding of 3D objects, making the end product useless and hard to further edit by humans. AI texturing alone, without generating mesh and UV maps, seems a more achievable goal in terms of commercialization.
Moreover, instead of generating 3D models purely from text prompts, gaming companies show a preference for converting images to 3D models. This is because they can utilize concept art — visual representations of characters, scenes, or objects — as direct input for creating detailed 3D models.
On top of that, art style is important too. Game studios hope to get all 3D models in a product in the same art styles. Gaming companies pay crazily high amounts of money to customize 3D models despite abundant choices on marketplaces like Unity Store and Sketchfab. From a technical perspective, we can use Lora, finetuning, or prompting for a consistent art style.
Technical Analysis
Common issues with AI texturing
- Light reflections and shadows. It is generally preferable that they do not contain shadows before rendering. Thus, a texture with heavy shadows becomes unusable.
- Wrong semantics. If the 3D mesh represents a car, the corresponding texture should give the tires a dark rubber color and not cover the window with colorful patterns.
- View inconsistency. Because popular texturing methods derive multi-views from a single-view image, the resulting 3D object may have different content or color themes.
- Bad color tones. Sometimes a 3D texture does not fall into any of the prior categories but the appearance is bad in a subjective way. This usually happens when the texture violates color principles.
Technical challenges
Data. 3D modeling data is scarce by its nature. Open-sourced datasets like Objaverse & ShapeNet contain a lot of bad-quality data. In my opinion, scanned models are bad for training too because their structures and colors are not clean. Handmade stylized 3D models, on the other hand, are owned by private companies as their core art assets. Even if the companies agree to use them as training data, the quantity of tens of thousands is still too few.
Algorithms. Data scarcity puts pressure on algorithms to use existing datasets efficiently. Currently, startups are primarily divided into two camps based on their tech approaches. Some utilize large 2D diffusion models like Stable Diffusion, adopting a pipeline that leverages 2D views to 3D models, while others want to base both training and inference on 3D data. The latter should yield better results but only with a large amount of data, which I estimate to be more than 100k high-quality objects.
Inference time. Right now, an object can take 10–20 minutes to generate, depending on the mesh and the GPU. For commercial products, it’s crucial to reduce the wait time to under one minute for end users. My team estimates that inference time will be cut to very short, say 10 seconds, in 2 years with more efficient algorithms.
Market Landscape
Active startups
- Vast: believes in 3D foundation models and incorporates much unique 3D data for training. Despite that texture generation is still far from using in my games, the Tripo model is one of the best I’ve known after trying a bunch.
- Meshy: aims to empower game studios and has an active community of talented individuals. The team executed everything so well and had many insights.
- Luma: allows people to scan objects using their smartphones and remix objects, which raised 43 million and is leveraging a compute cluster of ~3,000 Nvidia A100 GPUs. It’s kinda cool that you can use the app to scan the surroundings and save moments in 3D. The super resourceful team doesn’t have plans to tailor products specific to gaming.
- CSM: is said to be good at mesh generation, potentially for 3D printing and gaming. They were founded in 2020 and have raised 10 million from VCs.
Major projects by leading tech companies
- OpenAI ShapE: I think was trained from mostly scanned data and is only used by a few.
- Nvidia Picasso: super early in the lab.
Startups pivoted
Many startups pivoted from the AI 3D space in the past 1 or 2 years.
- WithPoly: initially built AI tools for texture patterns, which is slightly different and easier than making textures. Backed by Y Combinator, they pivoted last year into smart cloud storage.
- Scenario: started with AI 3D and pivoted to generating 2D assets and other components for games.
- Genmo AI: started with AI 3D too, founded by top researchers in AI 3D, and later shifted focus to text-to-video generation.
Why AI 3D Startups Struggle?
It’s almost a consensus among the startup community that, AI 3D’s tech is not ready yet to generate ready-to-use 3D assets.
If we back up a bit, would it be possible to build a copilot? That is, if AI can finish 80% of the work, and we have human labor to complete the remaining 20%? Unfortunately, the answer is no. AI-generated 3D models are too rough and it takes more time to fix them than just making a new one from scratch manually.
From my own experience, it is possible to use AI for 3D models useful at the brainstorming stage or the marketing stage of a game. If game studios want some fast prototypes for some ideas, or some assets in various art styles for quick A/B tests in their ads, then AI can help them fast and cheap. However, companies won’t be willing to pay a lot for these.
Are there some hacks, or stupid methods that can work? Sloyd.ai is an example of retrieving relevant 3D models and altering their appearances through parameters. Kaedim3D is another example, at least at some point, that used pure human labor to generate 3D assets while claiming that they were generated by AI. Now it states to use both humans and AI to ship 3D models within 20 minutes at an affordable cost and just closed a round of 15 million.
Besides B2B, how’s AI 3D in the consumer market? It would be cool to create and interact with 3D scenes and objects through VR devices. 2024 was expected to be a big year for VR and AR as Apple was to launch Vision Pro in early February. However, within 2 weeks after the launch, stores are seeing conversion rates as high as 10% to 15% after demos, as reported by Bloomberg.
In my opinion, the ultimate killer is yet to come, which is, the technology of text-to-video generation. A few weeks ago, OpenAI announced the video generation model Sora, which achieves amazing 3D consistency, or consistent geometry and color of an object from different camera angles. If we recall the process of generating 3D assets from multi-views, then we know that video generation models will be able to create 3D assets eventually, and that day is likely to come sooner than later.
Conclusion
In the end, I’d like to share two points regarding the evolving landscape of 3D AI.
Firstly, despite the impressive capabilities of AI algorithms, it’s crucial to recognize that AI will not replace human creativity. The depth and nuance of human artistic expression, derived from our experiences and emotions, remain unique to humans.
Secondly, for startups navigating this innovative domain, the emphasis should be on achieving product-market fit rather than solely focusing on technological novelty. Success in this field hinges on creating solutions that resonate with market demands.