This repository contains the official implementation of "The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion".
Language of Motion (LoM) is a framework that models human motion generation as a sequence modeling problem using language models. It decomposes the human body into separate regions (face, hands, upper, and lower body) to effectively capture and generate natural human movements from various modalities such as text and audio.
- Initial code release
- Inference code for text-to-motion
- Inference code for co-speech gesture generation
- Tokenizer training code
- AMASS and LibriSpeech preprocessing code
- Evaluation Benchmark results
- Text-to-motion Result on rotation format
- Language model training code
We use Conda for environment management. Follow these steps to set up the development environment:
# Create and activate the conda environment
conda create --name lom -y python=3.10
conda activate lom
# Install PyTorch with CUDA support
conda install pytorch==2.4.0 torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# Alternative for RTX 5090 users: install pytorch by following way
# pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
# Install pip and dependencies
python -m pip install pip==21.3
pip install -r requirements.txt
# Install additional packages
pip install turbot5 -U
# Alternative for RTX 5090 users: upgrade triton to support the new architecture
# pip install --upgrade "git+https://github.com/openai/triton.git@main#egg=triton&subdirectory=python"
# export TRITON_JIT_CUDA_ARCHITECTURES=$(
# python - <<'EOF'
# import torch
# p = torch.cuda.get_device_properties(0)
# print(f"{p.major}{p.minor}")
# EOF
# )
# Install NLP tools
python -m spacy download en_core_web_sm
# Set up fairseq (required for some components)
mkdir -p third_party
cd third_party
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
cd ../..
We use TEMOS for rendering. Install it with our provided script:
# Execute the setup script to install Blender and its dependencies
chmod +x setup_blender.sh
./setup_blender.sh
This script will:
- Download and extract Blender 2.93.18
- Verify the Blender Python path
- Install all necessary Python packages for rendering
Please register an account on the Max Planck Institute for Intelligent Systems (MPI-IS) website to access the necessary SMPLX models. Then download the SMPLX models, Hubert, T5, and T2M metrics computation checkpoints by running the following script:
chmod +x build_resources.sh
./build_resources.sh
After running the script, you will have the following directory structure:
model_files/
├── hubert_models/ # Hubert audio tokenizer models
├── smplx_models/ # SMPLX body models
├── t2m_evaluators/ # Text-to-Motion evaluation metrics
└── t5_models/ # T5 language models
Pretrained models are gradually uploading! Visit the Hugging Face repository to download them.
python demo.py --cfg configs/demo_text2motion.yaml --text examples/text2motion.txt --task text2motion --render
python demo.py --cfg configs/demo_cospeech.yaml --audio examples/2_scott_0_111_111.wav --task cospeech --render
To train the model, you will need to download the following datasets:
- AMASS: Human motion dataset from AMASS website, with text annotation from HumanML3D.
- BEAT2: Co-speech gesture dataset containing synchronized speech, emotion label, and motion data, available from the BEAT website.
- LibriSpeech: Large-scale (1000+ hours) corpus of read English speech, downloadable from the LibriSpeech website.
After downloading, organize the datasets according to the following structure (detailed preprocessing instructions will be provided soon):
datasets/
├── AMASS/
├── BEAT2/
├── beat_chinese_v2.0.0/
├── beat_english_v2.0.0/
├── beat_japanese_v2.0.0/
├── beat_spanish_v2.0.0/
└── LibriSpeech/
Our comprehensive training documentation is coming soon! We'll provide detailed instructions for all three stages:
- Compositional Motion Tokenization (VQ-VAE Training)
- Language Model Pretraining
- Task-Specific Fine-tuning
Stay tuned for updates on our training procedures and best practices.
Evaluation metrics and benchmarking result are currently being prepared. Soon, we'll provide:
- Standardized evaluation scripts for all supported tasks
- Benchmark results on public datasets
- Comparison with SOTA methods
Check back for updates or follow our GitHub repository for notifications.
If you find our work useful for your research, please consider citing:
@article{chen2024language,
title={The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion},
author={Chen, Changan and Zhang, Juze and Lakshmikanth, Shrinidhi K and Fang, Yusu and Shao, Ruizhi and Wetzstein, Gordon and Fei-Fei, Li and Adeli, Ehsan},
journal={CVPR},
year={2025}
}
This project was partially funded by NIH grant R01AG089169 and UST. The authors would also like to thank Georgios Pavlakos for his valuable discussion, Chaitanya Patel, Jingyan Zhang, and Bin Li for their feedback on the paper.