NVIDIA Cosmos is a developer-first world foundation model platform designed to help Physical AI developers build their Physical AI systems better and faster. Cosmos contains
- Pre-trained models (available via Hugging Face) under the NVIDIA Open Model License that allows commercial use of the models for free.
- Training scripts under the Apache 2 License for post-training the models for various downstream Physical AI applications.
Cosmos-Predict1 includes the following features:
- Diffusion-based world foundation models for Text2World and Video2World generation, where a user can generate visual simulation based on text prompts and video prompts.
- Autoregressive-based world foundation models for Video2World generation, where a user can generate visual simulation based on video prompts and optional text prompts.
- Image and video tokenizers for tokenizing videos into continuous tokens (latent vectors) and discrete tokens (integers) efficiently and effectively.
Inference with pre-trained models:
- Inference with diffusion-based Text2World models [with multi-GPU support]
- Inference with diffusion-based Video2World models [with multi-GPU support]
- Inference with autoregressive-based base models [with multi-GPU support]
- Inference with autoregressive-based Video2World models [with multi-GPU support]
- Inference with tokenizer models
Post-training models:
- Post-training diffusion-based Text2World models [with multi-GPU support]
- Post-training diffusion-based Video2World models [with multi-GPU support]
- Post-training diffusion-based Text2World models (with multi-view data) [with multi-GPU support]
- Post-training diffusion-based Video2World models (with multi-view data) [with multi-GPU support]
- Post-training autoregressive-based base models [with multi-GPU support]
- Post-training tokenizer models [with multi-GPU support]
Inference with post-trained models:
- Inference with diffusion-based Text2World models (with multi-view data) [with multi-GPU support]
- Inference with diffusion-based Video2World models (with multi-view data) [with multi-GPU support]
The code snippet below provides a gist of the inference usage.
PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
--prompt "${PROMPT}" \
--offload_prompt_upsampler \
--video_save_name diffusion-text2world-7btext2world-example.mp4
We provide a series of pre-trained models of different families, available for download on Hugging Face.
Diffusion models
- Cosmos-Predict1-7B-Text2World: Text to visual world generation
- Cosmos-Predict1-14B-Text2World: Text to visual world generation
- Cosmos-Predict1-7B-Video2World: Video + Text based future visual world generation
- Cosmos-Predict1-14B-Video2World: Video + Text based future visual world generation
Autoregressive models
- Cosmos-Predict1-4B: Future visual world generation
- Cosmos-Predict1-12B: Future visual world generation
- Cosmos-Predict1-5B-Video2World: Video + Text based future visual world generation
- Cosmos-Predict1-13B-Video2World: Video + Text based future visual world generation
Tokenizers
- Cosmos-Tokenize1-CV8×8×8-720p: Continuous Video Tokenizer with 8x8x8 spatio-temporal compression with, 121 frames context
- Cosmos-Tokenize1-DV8×16×16-720p: Discrete Video Tokenizer with 8x16x16 spatio-temporal compression, and 49 frames context
- Cosmos-Tokenize1-CI8×8-360p: Continuous Image Tokenizer with 8x8 spatial compression with low-resolution support
- Cosmos-Tokenize1-CI16x16-360p: Continuous Image Tokenizer with 16x16 spatial compression with low-resolution support
- Cosmos-Tokenize1-CV4×8×8-360p: Continuous Video Tokenizer with 4x8x8 spatio-temporal compression with low-resolution support
- Cosmos-Tokenize1-DI8×8-360p: Discrete Image Tokenizer with 8x8 spatial compression with low-resolution support
- Cosmos-Tokenize1-DI16x16-360p: Discrete Image Tokenizer with 16x16 spatial compression with low-resolution support
- Cosmos-Tokenize1-DV4×8×8-360p: Discrete Video Tokenizer with 4x8x8 spatio-temporal compression with low-resolution support
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
NVIDIA Cosmos source code is released under the Apache 2 License.
NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact [email protected].
