GitHub - Aurora-edu/cosmos-predict1: Cosmos-Predict1 is a collection of general-purpose world foundation models for Physical AI that can be fine-tuned into customized world models for downstream applications.

Website | Hugging Face | Paper | Paper Website

NVIDIA Cosmos is a developer-first world foundation model platform designed to help Physical AI developers build their Physical AI systems better and faster. Cosmos contains

Pre-trained models (available via Hugging Face) under the NVIDIA Open Model License that allows commercial use of the models for free.
Training scripts under the Apache 2 License for post-training the models for various downstream Physical AI applications.

Key Features

Cosmos-Predict1 includes the following features:

Diffusion-based world foundation models for Text2World and Video2World generation, where a user can generate visual simulation based on text prompts and video prompts.
Autoregressive-based world foundation models for Video2World generation, where a user can generate visual simulation based on video prompts and optional text prompts.
Image and video tokenizers for tokenizing videos into continuous tokens (latent vectors) and discrete tokens (integers) efficiently and effectively.

Examples

Inference with pre-trained models:

Inference with diffusion-based Text2World models [with multi-GPU support]
Inference with diffusion-based Video2World models [with multi-GPU support]
Inference with autoregressive-based base models [with multi-GPU support]
Inference with autoregressive-based Video2World models [with multi-GPU support]
Inference with tokenizer models

Post-training models:

Post-training diffusion-based Text2World models [with multi-GPU support]
Post-training diffusion-based Video2World models [with multi-GPU support]
Post-training diffusion-based Text2World models (with multi-view data) [with multi-GPU support]
Post-training diffusion-based Video2World models (with multi-view data) [with multi-GPU support]
Post-training autoregressive-based base models [with multi-GPU support]
Post-training tokenizer models [with multi-GPU support]

Inference with post-trained models:

Inference with diffusion-based Text2World models (with multi-view data) [with multi-GPU support]
Inference with diffusion-based Video2World models (with multi-view data) [with multi-GPU support]

The code snippet below provides a gist of the inference usage.

PROMPT="A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/text2world.py \
    --checkpoint_dir checkpoints \
    --diffusion_transformer_dir Cosmos-Predict1-7B-Text2World \
    --prompt "${PROMPT}" \
    --offload_prompt_upsampler \
    --video_save_name diffusion-text2world-7b

text2world-example.mp4

Model Family

We provide a series of pre-trained models of different families, available for download on Hugging Face.

Diffusion models

Cosmos-Predict1-7B-Text2World: Text to visual world generation
Cosmos-Predict1-14B-Text2World: Text to visual world generation
Cosmos-Predict1-7B-Video2World: Video + Text based future visual world generation
Cosmos-Predict1-14B-Video2World: Video + Text based future visual world generation

Autoregressive models

Cosmos-Predict1-4B: Future visual world generation
Cosmos-Predict1-12B: Future visual world generation
Cosmos-Predict1-5B-Video2World: Video + Text based future visual world generation
Cosmos-Predict1-13B-Video2World: Video + Text based future visual world generation

Tokenizers

Cosmos-Tokenize1-CV8×8×8-720p: Continuous Video Tokenizer with 8x8x8 spatio-temporal compression with, 121 frames context
Cosmos-Tokenize1-DV8×16×16-720p: Discrete Video Tokenizer with 8x16x16 spatio-temporal compression, and 49 frames context
Cosmos-Tokenize1-CI8×8-360p: Continuous Image Tokenizer with 8x8 spatial compression with low-resolution support
Cosmos-Tokenize1-CI16x16-360p: Continuous Image Tokenizer with 16x16 spatial compression with low-resolution support
Cosmos-Tokenize1-CV4×8×8-360p: Continuous Video Tokenizer with 4x8x8 spatio-temporal compression with low-resolution support
Cosmos-Tokenize1-DI8×8-360p: Discrete Image Tokenizer with 8x8 spatial compression with low-resolution support
Cosmos-Tokenize1-DI16x16-360p: Discrete Image Tokenizer with 16x16 spatial compression with low-resolution support
Cosmos-Tokenize1-DV4×8×8-360p: Discrete Video Tokenizer with 4x8x8 spatio-temporal compression with low-resolution support

License and Contact

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

NVIDIA Cosmos source code is released under the Apache 2 License.

NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
checkpoints		checkpoints
cosmos_predict1		cosmos_predict1
datasets		datasets
examples		examples
scripts		scripts
ATTRIBUTIONS.md		ATTRIBUTIONS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
cosmos-predict1.yaml		cosmos-predict1.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Website | Hugging Face | Paper | Paper Website

Key Features

Examples

Model Family

License and Contact

About

Uh oh!

Releases

Packages

Languages

License

Aurora-edu/cosmos-predict1

Folders and files

Latest commit

History

Repository files navigation

Website | Hugging Face | Paper | Paper Website

Key Features

Examples

Model Family

License and Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages