AnimationGPT, an AI to Generate Actions for Game Characters
I found this AnimationGPT, an AI that can generate combat-style character animations based on text input.
For example, if we type some text instructions like this: “The character leans to the right, twists their waist… and firmly grips the knives with both hands.” We click generate, wait for about 30 seconds, and then it gives us this animation.
Motivation
Why do we want to generate such game-specific animations?
While capturing human motions is an important research topic, the most popular dataset today, HumanML3D, only contains 14k actions, and most of them are daily actions like jogging or waving hands.
Hey, but we love GAMES!
If we have a variety of actions for our characters in the games, it will be much more immersive and fun to play. Machine learning can make this happen; it can simply read a text input and then generate many different animations.
For those who’re not that familiar, the BVH format stores motion data. We can import such files into 3D modeling software like Blender. It describes how a character should move, from arms to legs, every part of the body. It’s widely used in the game industry to animate characters.
Training Data
In order to train such an AI to generate these animations, we need training data. We need a lot of examples of character animations and the corresponding text descriptions.
To prepare the training data, the AnimationGPT team extracts high-quality actions from places like Mixamo and labels them with 7 attributes.
- Motion Type: describes the specific nature or style of the motion, such as pose, guard, ride, swimming, and falling.
- Weapon Type: indicates the type of weapon or object used in the action, such as bare hand, fist, claw, spear, dagger, and thrusting.
- Attack Type: provides more details about the attack, such as left-handed, right-handed, two-handed, heavy attack, and run attack.
- Direction: describes the direction or trajectory of the action, such as in place, towards left, towards right, forward, and left front.
- Power Descriptor: characterizes the amount of force or power involved in the motion, such as light-weight, steady, powerless, and powerful.
- Speed Descriptor: indicates the velocity or speed of the motion, such as swift, relatively fast, uniform speed, slow, or first slow then fast.
- Fuzzy Descriptor: captures the general nature or type of the motion or attack, such as piercing, slash, blunt, and unarmed attack.
Note that, one single action can have multiple labels for the same attribute.
For a clip like this, here is the labeling process. Human labelers will first give one or two labels to each of the 7 attributes that we just introduced. For example, this clip is a weapon attack for motion type and is relatively fast and clean for the speed descriptor. This makes sense- we see the character swiping some sword type of weapon and the action, yeah, is fast.
Then the team uses large language models like GPT to write a few different text descriptions and pick their favorite as the corresponding text to the clip.
After lots of labeling work, the team obtained a dataset of 15k actions and their corresponding text descriptions, which is pretty nice. [GitHub]
Model Selection
The next step is to pick a machine-learning model. For text-to-motion, the three papers below are all for text-to-action generation and can serve as a good reference for us.
The papers proposed the following methods-
- Motion Diffusion Model;
- Motion Latent Diffusion Model; and
- Motion GPT, respectively.
From Motion Diffusion to Motion Latent Diffusion, the latter method has an additional VAE, a type of machine learning structure that makes it easier to understand and generate actions. From Motion Latent Diffusion to Motion GPT, we keep the VAE and replace the rest with a stronger language model.
Why do we get rid of the diffusion models here? Diffusion models are a type of machine learning structure that can generate new things by first destroying the input data into pure noise and then learning how to reverse that process to recreate the original data.
Diffusion models are great for image generation because they are good at randomized diversity and feature extraction. But not for MOTION generation. Actions do not have that much diversity, constrained by the physical limitations of the human body. There are not many features to extract.
MotionGPT adopts T5, a type of machine learning structure with both encoder and decoder. T5 is good at understanding text prompts while getting rid of diffusion’s excessive randomness.
After experimenting with different structures, the AnimationGPT team picked MotionGPT.
It has two main components: motion tokenizer and motion-aware language model. The action tokenizer first encodes a continuous sequence of actions into discrete action tokens, and then the motion-aware language model learns the semantic relations between these action tokens and text tokens.
Evaluation
An ideal action generation model should have the following characteristics:
- matches the description of the input text;
- is close to the distribution of the training data;
- has diversity; and
- follows the laws of physics.
We have some quantitative metrics here to measure how good the AI model generates actions.
FID, or Frechet Inception Distance, evaluates the similarity between the distribution of generated actions and real actions. The smaller the value, the smaller the distribution difference between generated results and real results. The Matching Score describes the average Euclidean distance between each generated action feature and its corresponding input text feature. A smaller Matching Score is better because it means the animation matches the input text.
As we see, AnimationGPT performs very well on these two metrics.
Industrial Insights
In general, building AI applications requires computing power, algorithms, and data. Datasets for images, text, and music/sound effects have unified standards and are relatively easy to collect. Unfortunately, 3D-related data is often very scarce.
Even the data representation is fragmented:
- In the industry, animators store animation information through FBX or other files, record the rotation and displacement of bones with different names, and share files for collaboration;
- In the field of computer graphics, people tend to use BVH formats and focus on learning and processing these data;
- In computer vision, people tend to use SMPL, a family of 3D human body models, to improve the performance of existing methods.
Just as pointed out by MoFusion:
“…Motion datasets are notorious for not following the same convention, due to the specific needs of their creators. Each dataset can use a skeleton with a different number of joints, different joint indices, or a different default pose…”
Whether in academic or industrial fields, motion data is primitive and chaotic in some way.
Conclusion
In 2019, the Max Planck Society used optical motion capture methods to produce the AMASS dataset; it has 11k daily human activities. In 2022, HumanML3D expanded this dataset to 14.6k actions.
On top of these, the dataset by the AnimationGPT team is a great complement to the daily life kind of motion data. AnimationGPT can generate anime-style EXPRESSIVE gestures specific to games, which are otherwise impossible to obtain via motion captures. It will be helpful to the 3D and gaming communities.
The dataset still has some limitations. For example, it can’t condition frame rates or control like a human animator. Of course, AI is not as good as humans but I think it can help human developers to make games faster. I look forward to more progress from the team and beyond.
AnimationGPT is now available on their website and GitHub. You don’t need programming skills to try it out on the website; if you’d like to change the code or data, then Github is a better option.
What are the AI features you look forward to? Share your thoughts with us!