GitHub - nvidia-cosmos/cosmos-predict2: Cosmos-Predict2 is a collection of general-purpose world foundation models for Physical AI that can be fine-tuned into customized world models for downstream applications.

Paper (coming soon!) | Website | Hugging Face

Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.

We visualize the architecture of Cosmos-Predict2 in the following figure.

News

2025-07-10: We released Predict2 + NATTEN, bringing up to 2.6X end-to-end inference speedup with sparse attention (Video).
2025-06-11: We released post-training and inference code, along with model weights. For a code walkthrough, please see this video.

Models

Cosmos-Predict2-2B-Text2Image: Text-to-image generation
Cosmos-Predict2-14B-Text2Image: Text-to-image generation
Cosmos-Predict2-2B-Video2World: Video + Text based future visual world generation
Cosmos-Predict2-14B-Video2World: Video + Text based future visual world generation
Cosmos-Predict2-14B-Sample-GR00T-Dreams-GR1: Video + Text based future visual world generation, post-trained on GR00T Dreams GR1 dataset
Cosmos-Predict2-14B-Sample-GR00T-Dreams-DROID: Video + Text based future visual world generation, post-trained on GR00T Dreams DROID dataset
Cosmos-Predict2-2B-Sample-Action-Conditioned: Video + Action based future visual world generation, post-trained on Bridge dataset

Quick Start

Here is a quick example demonstrating how to use Cosmos-Predict2-2B-Video2World for video generation:

import torch
from imaginaire.utils.io import save_image_or_video
from cosmos_predict2.configs.base.config_video2world import PREDICT2_VIDEO2WORLD_PIPELINE_2B
from cosmos_predict2.pipelines.video2world import Video2WorldPipeline

# Create the video generation pipeline.
pipe = Video2WorldPipeline.from_config(
    config=PREDICT2_VIDEO2WORLD_PIPELINE_2B,
    dit_path="checkpoints/nvidia/Cosmos-Predict2-2B-Video2World/model-720p-16fps.pt",
    text_encoder_path="checkpoints/google-t5/t5-11b",
)

# Specify the input image path and text prompt.
image_path = "assets/video2world/example_input.jpg"
prompt = "A high-definition video captures the precision of robotic welding in an industrial setting. The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation."

# Run the video generation pipeline.
video = pipe(input_path=image_path, prompt=prompt)

# Save the resulting output video.
save_image_or_video(video, "output/test.mp4", fps=16)

Input prompt:

A high-definition video captures the precision of robotic welding in an industrial setting. The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation.

Input image	Output video
	video2world_2b_example.mp4

User Guide

Our setup guide provides complete information on

System requirements: Detailed hardware and software prerequisites
Installation: Step-by-step setup with both Conda and Docker options
Downloading checkpoints: Instructions for obtaining model weights
Troubleshooting: Solutions for common installation and CUDA compatibility issues

For inference examples and usage

Text2Image Inference: Guide for generating high-quality images from text prompts
Video2World Inference: Guide for generating videos from images/videos with text prompts, including:
- Single and batch processing
- Multi-frame conditioning
- Multi-GPU inference for faster generation
- Using the prompt refiner
- Rejection sampling for quality improvement
Text2World Inference: Guide for generating videos directly from text prompts, including:
- Single and batch processing
- Multi-GPU inference for faster generation

For post-training customization

Video2World Post-training guide: General guide to the video2world training system in the codebase.
Video2World Post-training on Cosmos-NeMo-Assets: Case study for post-training on Cosmos-NeMo-Assets data
Video2World Post-training on fisheye-view AgiBotWorld-Alpha dataset: Case study for post-training on fisheye-view robot videos from AgiBotWorld-Alpha dataset.
Video2World Post-training on GR00T Dreams GR1 and DROID datasets: Case study for post-training on GR00T Dreams GR1 and DROID datasets.
Video2World Action-conditioned Post-training on Bridge dataset: Case study for action-conditioned post-training on Bridge dataset.
Text2Image Post-training guide: General guide to the text2image training system in the codebase.
Text2Image Post-training on Cosmos-NeMo-Assets: Case study for post-training on Cosmos-NeMo-Assets image data.

Our performance guide includes

Hardware requirements: Recommended GPU configurations and memory requirements
Performance benchmarks: Detailed speed and quality comparisons across different GPU architectures
Model selection guide: Practical advice for choosing between 2B and 14B variants based on your needs

Contributing

We thrive on community collaboration! NVIDIA-Cosmos wouldn't be where it is without contributions from developers like you. Check out our Contributing Guide to get started, and share your feedback through issues.

Big thanks 🙏 to everyone helping us push the boundaries of open-source physical AI!

License and Contact

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

This model includes safety and content moderation features powered by Llama Guard 3. Llama Guard 3 is used solely as a content input filter and is subject to its own license.

NVIDIA Cosmos source code is released under the Apache 2 License.

NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
assets		assets
checkpoints		checkpoints
cosmos_predict2		cosmos_predict2
datasets		datasets
documentations		documentations
examples		examples
imaginaire		imaginaire
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
ATTRIBUTIONS.md		ATTRIBUTIONS.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cosmos-predict2.yaml		cosmos-predict2.yaml
justfile		justfile
pyproject.toml		pyproject.toml
requirements-docker.txt		requirements-docker.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper (coming soon!) | Website | Hugging Face

News

Models

Quick Start

User Guide

Contributing

License and Contact

About

Uh oh!

Releases 2

Packages

Contributors 20

Languages

License

nvidia-cosmos/cosmos-predict2

Folders and files

Latest commit

History

Repository files navigation

Paper (coming soon!) | Website | Hugging Face

News

Models

Quick Start

User Guide

Contributing

License and Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 20

Languages

Packages