Three Challenges of Training Large Models, Tech Founder’s Perspective

Cindy X. L.
2 min readFeb 29, 2024

--

Speaking from personal experience, it is tricky to collect data, find methodologies, and benchmark when training large models.

Collect Data

First of all, training large models requires massive amounts of data, which are generally obtained in several ways:

  • Original data of own business (Not enough data)
  • Partner data (Are they willing to?)
  • Purchase data (Sky-high price, and oftentimes simply no market)
  • Collecting data from the community (Poor data quality)
  • Generate data in batches (Not realistic enough)

These data require manual cleaning and additional annotation, and the huge amount of data requires specialized annotation tools. Different data sources and how to understand and balance the distribution of data will affect the final results.

Find the Methodology

Many non-tech people think that training large models is like cooking: If you have the raw materials and know the recipe, you only need to stir the pot a few times and it will be ready.

In contrast, the training process is more like a treasure hunt in the jungle. The compass is the top talent and the computing power is the shovel. You have to constantly try and make corrections from the forked roads to find the optimal combination of algorithms and data.

Facts have proved that even if it is as powerful as Google, there will be times when the technology path fails to win. For example, after OpenAI Sora came out, other text-to-video teams realized it and quickly switched to the DiT architecture.

Due to the emergence of large models when they reach a certain scale, one cannot conduct experiments on small models at low cost. Each training takes a ton of money.

Personal estimation: knowing the general direction of technology and then using the same data for training can save 60–90% of training costs. However, there is still a lot of room for error in the details, such as data sources, model parameter selection, pre-processing caveats, etc.

Evaluate/Benchmark

I’ll skip the part about how today’s LLMs trick to improve rankings. Even internally, there is no particularly good way to quantitatively evaluate the generative models. Many research papers still rely solely on user studies, judging results with the naked eye (yes, I scold myself).

I remember from last year that a YC-backed startup specializes in model evaluation solutions. I don’t know what’s going on now.

To sum up- yup, it’s hard even for big tech. Let’s see who will win the war.

--

--

Cindy X. L.
Cindy X. L.

Written by Cindy X. L.

Tech influencer (150k on Weibo), Columbia alum. This is my tiny corner to write about AI, China tech, and creator economy. Views are my own.

Responses (2)