Paper (coming soon!) | Website | Hugging Face
Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.
We visualize the architecture of Cosmos-Predict2 in the following figure.
- 2025-07-10: We released Predict2 + NATTEN, bringing up to 2.6X end-to-end inference speedup with sparse attention (Video).
- 2025-06-11: We released post-training and inference code, along with model weights. For a code walkthrough, please see this video.
- Cosmos-Predict2-2B-Text2Image: Text-to-image generation
- Cosmos-Predict2-14B-Text2Image: Text-to-image generation
- Cosmos-Predict2-2B-Video2World: Video + Text based future visual world generation
- Cosmos-Predict2-14B-Video2World: Video + Text based future visual world generation
- Cosmos-Predict2-14B-Sample-GR00T-Dreams-GR1: Video + Text based future visual world generation, post-trained on GR00T Dreams GR1 dataset
- Cosmos-Predict2-14B-Sample-GR00T-Dreams-DROID: Video + Text based future visual world generation, post-trained on GR00T Dreams DROID dataset
- Cosmos-Predict2-2B-Sample-Action-Conditioned: Video + Action based future visual world generation, post-trained on Bridge dataset
Here is a quick example demonstrating how to use Cosmos-Predict2-2B-Video2World for video generation:
import torch
from imaginaire.utils.io import save_image_or_video
from cosmos_predict2.configs.base.config_video2world import PREDICT2_VIDEO2WORLD_PIPELINE_2B
from cosmos_predict2.pipelines.video2world import Video2WorldPipeline
# Create the video generation pipeline.
pipe = Video2WorldPipeline.from_config(
config=PREDICT2_VIDEO2WORLD_PIPELINE_2B,
dit_path="checkpoints/nvidia/Cosmos-Predict2-2B-Video2World/model-720p-16fps.pt",
text_encoder_path="checkpoints/google-t5/t5-11b",
)
# Specify the input image path and text prompt.
image_path = "assets/video2world/example_input.jpg"
prompt = "A high-definition video captures the precision of robotic welding in an industrial setting. The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation."
# Run the video generation pipeline.
video = pipe(input_path=image_path, prompt=prompt)
# Save the resulting output video.
save_image_or_video(video, "output/test.mp4", fps=16)
Input prompt:
A high-definition video captures the precision of robotic welding in an industrial setting. The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation.
Input image | Output video |
---|---|
![]() |
video2world_2b_example.mp4 |
Our setup guide provides complete information on
- System requirements: Detailed hardware and software prerequisites
- Installation: Step-by-step setup with both Conda and Docker options
- Downloading checkpoints: Instructions for obtaining model weights
- Troubleshooting: Solutions for common installation and CUDA compatibility issues
For inference examples and usage
- Text2Image Inference: Guide for generating high-quality images from text prompts
- Video2World Inference: Guide for generating videos from images/videos with text prompts, including:
- Single and batch processing
- Multi-frame conditioning
- Multi-GPU inference for faster generation
- Using the prompt refiner
- Rejection sampling for quality improvement
- Text2World Inference: Guide for generating videos directly from text prompts, including:
- Single and batch processing
- Multi-GPU inference for faster generation
For post-training customization
- Video2World Post-training guide: General guide to the video2world training system in the codebase.
- Video2World Post-training on Cosmos-NeMo-Assets: Case study for post-training on Cosmos-NeMo-Assets data
- Video2World Post-training on fisheye-view AgiBotWorld-Alpha dataset: Case study for post-training on fisheye-view robot videos from AgiBotWorld-Alpha dataset.
- Video2World Post-training on GR00T Dreams GR1 and DROID datasets: Case study for post-training on GR00T Dreams GR1 and DROID datasets.
- Video2World Action-conditioned Post-training on Bridge dataset: Case study for action-conditioned post-training on Bridge dataset.
- Text2Image Post-training guide: General guide to the text2image training system in the codebase.
- Text2Image Post-training on Cosmos-NeMo-Assets: Case study for post-training on Cosmos-NeMo-Assets image data.
Our performance guide includes
- Hardware requirements: Recommended GPU configurations and memory requirements
- Performance benchmarks: Detailed speed and quality comparisons across different GPU architectures
- Model selection guide: Practical advice for choosing between 2B and 14B variants based on your needs
We thrive on community collaboration! NVIDIA-Cosmos wouldn't be where it is without contributions from developers like you. Check out our Contributing Guide to get started, and share your feedback through issues.
Big thanks 🙏 to everyone helping us push the boundaries of open-source physical AI!
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
This model includes safety and content moderation features powered by Llama Guard 3. Llama Guard 3 is used solely as a content input filter and is subject to its own license.
NVIDIA Cosmos source code is released under the Apache 2 License.
NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact [email protected].