Nvidia’s new AI can build LLMs cheaper! Upcycling MoE explained
In this technical interview on YouTube, Ethan, a research engineer at NVIDIA, breaks down the fascinating world of Mixture of Experts (MoE) models and discusses their groundbreaking paper on model upscaling. Learn how MoE is revolutionizing the way we build and scale large language models!
Self-Intro
Cindy: Can you give some intro about what you do at Nvidia?
Ethan: Hi, I’m Ethan. I’m a deep learning engineer at Nvidia, working on large language model training and scaling transformers to large scale, including Mixture of Experts.
What is MoE
Cindy: Some of us might not be familiar with the term Mixture of Experts, so do you mind giving us a quick overview?
Ethan: You can think of it as a single model, but it combines several models together. Depending on what your question is, it selectively activates some of the experts to answer your question.
You may have heard of LLaMA models or GPT models already — most of these models are dense models. All of the neurons are mostly activated in the model inference, that’s why they’re called dense. However, for Mixture of Experts, on the other hand, they are sparsely activated models. That means whenever you give it a query, only part of the model is activated.
Cindy: What’s the algorithm usually used to decide which expert to activate?
Ethan: The thing used to select an expert is called a router. The router is simply a very small neural network that can decide which expert to activate during the inference. Based on your input, it will generate a probability distribution over the matter of experts, then it selects the top K experts to answer a question.
Cindy: Each expert will take care of a different field or a different aspect of the model?
Ethan: Not exactly so. There are some people who try to interpret whether each expert will learn something different, but neural networks are known to learn disentangled knowledge, so it’s very hard to tell which field this expert is learning. So it’s not like oh, this agent or this expert studies math, and the other one studies art — it’s not like that. There are some basic ones that some experts will focus on like vision token generalizations, some focus on cross-tokens.
Cindy: Cross tokens?
Ethan: Yeah, like the relationship between tokens.
Advantages
Cindy: So what’s the main advantage compared to, I guess, traditional methods to build LLMs?
Ethan: Mixture of Experts allows scaling the models to the next level. Let’s say LLaMA models — the biggest model they scaled is to 405B. Like if you want to scale the model beyond that, you either need a lot more GPUs, or you can use a Mixture of Experts. You can scale the same LLaMA 405B up to like 2 trillion parameters, and without significantly increasing the compute.
Cindy: Ah I see. So who in the industry will need such upscaling or building larger LLMs with MoE?
Ethan: Hard to say… almost everyone will need it. It’s about to be an essential technique, no matter small company or a bigger company. For bigger companies like OpenAI, they’ve already served their GPT-4, which is rumored to be a MoE model. MoE is much more efficient at large scale because the model is sparsely activated. You save more money, and for smaller companies, you save more money training them rather than training a dense model. I’d say this will become a standard for building large language models.
The paper
Cindy: We’re gonna dive deeper into the paper you and your team have just published. What kind of problem is your paper trying to solve?
Ethan: Yeah, we are trying to solve upscaling at a large scale. Upscaling means instead of training an MoE from scratch, you upscale the existing dense model and then train like a few thousand or hundred thousand more iterations to make a better model.
Cindy: Is your paper going to be productionized?
Ethan: It’s already in our open source for the Megatron core. So we have the functionality to upscale large language models into bigger ones. You can find it on GitHub.
Cindy: So if today someone wants to train MoE with upscaling, then they can go to the GitHub repo right?
Ethan: Exactly.
Technical details
Cindy: It’s awesome. Now I think it’s a good time to dive into more technical details of the paper.
You experimented with different parameters such as learning rate and batch size. You found it very interesting that MoEs need lower learning rates and a higher batch size?
Ethan: This is because experts are sparsely activated. Each expert actually receives fewer tokens, which is a smaller batch size during training. That’s why you need the larger batch size.
Cindy: How about the load balancing? How does it add to the experiments?
Ethan: Load balancing is to ensure the MoE evenly distributes tokens among its experts. We found out that the less load balancing the better. You want the model to learn how to distribute the tokens by themselves. They’ll make sure there are no dead experts that receive no tokens.
Cindy: In your paper, you also mention the word granularity. Can you explain what it is and how does that affect the results of the training?
Ethan: Granularity is basically segmenting the experts into smaller ones. So this creates more experts than they should, a typical 8-expert scheme. It provides more flexibility and accuracy.
Cindy: I see, so imagine that we have more experts, does that mean that the model will perform better?
Ethan: Uh yes, but up to a limit. We have tested up to 200 experts. It still helps but beyond that, it becomes difficult for the model to learn like which experts to route to.
Cindy: In your paper, you also mention the concept of shared experts. What does it mean?
Ethan: Yeah, shared expert means some experts are always activated. However, usually, the experts are chosen based on the probability. Let’s say the probability is the highest, they will choose it; lower, they won’t choose it. But these shared experts are always active. They are meant to learn things in common among different domains.
The experience
Cindy: Is there something that you find surprising during experiments?
Ethan: The interesting thing we found is that the MoEs are pretty hard to train efficiently, especially at a large scale. If you need to do a lot of engineering optimization to train these models efficiently beyond like 7B or 16B.
Cindy: By the way, how many GPUs do you need to conduct those experiments?
Ethan: In this paper, we used three petaflops. It translates to roughly 500 GPUs for a week.
Limitations
Cindy: I guess we talked a lot of the good stuff about upscaling MoE. So are there some limitations or drawbacks?
Ethan: The drawback is the benefit itself I think. Upscaling applies to where you have the existing dense model and you want to adapt it into MoE. If you don’t have good dense models, then absolutely it’s a useless technique. For example, there’s no good existing Mamba model right now, so it’s hard to apply upscaling to Mamba.
The future
Cindy: What do you look forward to in the field of research of MoE?
Ethan: I look forward to larger and larger models. So far the biggest model rumored so far is one point eight trillion parameters GPT-4 MoE. I think there can be larger ones. Let’s see if there’s any in the future.
📚 Paper discussed: “Upscaling Large Language Models into Mixture of Experts” available at https://arxiv.org/abs/2410.07524.
💻 Available in NVIDIA’s Megatron-Core repository: https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/transformer/moe