Skip to content
This repository was archived by the owner on Jun 11, 2025. It is now read-only.

nv-tlabs/cosmos-av-sample-toolkits

Repository files navigation

This repository is no longer maintained. You can still use it for Cosmos-Transfer1. New repo: https://github.com/nv-tlabs/Cosmos-Drive-Dreams


Cosmos-AV-Sample Toolkits

This repo provides toolkits for:

  • A rendering script that converts RDS-HQ datasets into input videos (LiDAR and HDMAP) compatible with Cosmos-Transfer1-7B-Sample-AV.

  • A conversion script that converts open-source datasets (e.g., Waymo Open Dataset) into RDS-HQ format, so you can reuse the rendering script to render the input videos.

  • An interactive visualization tool to visualize the RDS-HQ dataset and generate novel ego trajectories.

  • 10 examples (collected by NVIDIA) of input prompts in raw format to help understand how to interface with the model.

[Paper] [Model Code] [Website]

Quick Start

We provide pre-rendered examples here (rendered HDMAP / rendered LiDAR / text prompts). You can download and use these examples to test Cosmos-Transfer1-7B-Sample-AV!

Installation

git clone https://github.com/nv-tlabs/cosmos-av-sample-toolkits.git
cd cosmos-av-sample-toolkits
conda env create -f environment.yaml
conda activate cosmos-av-toolkits

Download Examples

We provide 10 examples of input prompts with HD map and LiDAR, to help test the model.

  1. Add your SSH public key to your user settings on Hugging Face.
  2. Download the examples from Hugging Face (about 8GB):
git lfs install
git clone [email protected]:datasets/nvidia/Cosmos-Transfer1-7B-Sample-AV-Data-Example

Visualize Dataset

You can use visualize_rds_hq.py to visualize the RDS-HQ dataset.

python visualize_rds_hq.py -i <RDS_HQ_FOLDER> -c <CLIP_ID>

This python script will launch a viser server to visualize the 3D HD map world with dynamic bounding boxes. You can use

  • w a s d to control camera's position
  • q e to control camera's z-axis coordinate

viser

Note

You can run this script in a server with VS Code + remote-ssh plugin. VS Code will automatically forward the port of viser to the local host.

Rendering Cosmos Input Control Video

You can use render_from_rds_hq.py to render the HD map + bounding box / LiDAR condition videos from RDS-HQ dataset. GPU is required for rendering LiDAR.

python render_from_rds_hq.py -i <RDS_HQ_FOLDER> -o <OUTPUT_FOLDER> [--skip hdmap] [--skip lidar]

This will automatically launch multiple jobs based on Ray. If you want to use single process (e.g. for debugging), you can set USE_RAY=False in render_from_rds_hq.py. You can add --skip hdmap or --skip lidar to skip the rendering of HD map and LiDAR, respectively.

RDS-HQ Rendering Results

RDS-HQ Rendering Results

Note

If you're interested, we offer documentation that explains the NVIDIA f-theta camera in detail.

The output folder structure will be like this. Note that videos will only be generated when setting --post_training true.

<OUTPUT_FOLDER>
├── hdmap
│   └── {camera_type}_{camera_name}
│       ├── <CLIP_ID>_0.mp4
│       ├── <CLIP_ID>_1.mp4
│       └── ...
│
├── lidar
│   └── {camera_type}_{camera_name}
│       ├── <CLIP_ID>_0.mp4
│       ├── <CLIP_ID>_1.mp4
│       └── ...
│
└── videos
    └── {camera_type}_{camera_name}
        ├── <CLIP_ID>_0.mp4
        ├── <CLIP_ID>_1.mp4
        └── ...

Generate Novel Ego Trajectory

You can also use visualize_rds_hq.py to generate novel trajectories.

python visualize_rds_hq.py -i <RDS_HQ_FOLDER> [-np <NOVEL_POSE_FOLDER_NAME>]

Here <NOVEL_POSE_FOLDER_NAME> is the folder name for novel pose data. By default, it will be novel_pose.

Using the panel on the right, you record keyframe poses (make sure include the first frame and the last frame), and the script will interpolate all intermediate poses and save them as a .tar file in the novel_pose folder at <RDS_HQ_FOLDER>. Then you can pass --novel_pose_folder <NOVEL_POSE_FOLDER_NAME> to the rendering script render_from_rds_hq.py to use the novel ego trajectory.

python render_from_rds_hq.py -i <RDS_HQ_FOLDER> -o <OUTPUT_FOLDER> -np <NOVEL_POSE_FOLDER_NAME>

Convert Public Datasets

We provide a conversion and rendering script for the Waymo Open Dataset as an example of how information from another AV source can interface with the model.

Waymo Rendering Results (use ftheta intrinsics in RDS-HQ) Waymo Rendering Results

Waymo Rendering Results (use pinhole intrinsics in Waymo Open Dataset) Waymo Rendering Results

Our Model with Waymo Inputs

Prompting During Inference

We provide a captioning modification example to help users reproduce our results. To modify the weather in a certain prompt, we use a LLM. Below is an example transformation request:

Given the prompt:
"The video is captured from a camera mounted on a car. The camera is facing forward. The video depicts a driving scene in an urban environment. The car hood is white. The camera is positioned inside a vehicle, providing a first-person perspective of the road ahead. The street is lined with modern buildings, including a tall skyscraper on the right and a historic-looking building on the left. The road is clear of traffic, with only a few distant vehicles visible in the distance. The weather appears to be clear and sunny, with a blue sky and some clouds. The time of day seems to be daytime, as indicated by the bright sunlight and shadows. The scene is quiet and devoid of pedestrians or other obstacles, suggesting a smooth driving experience."
Modify the environment to:
1. Morning with fog
2. Golden hour with sunlight
3. Heavy snowy day
4. Heavy rainy day
5. Heavy fog in the evening
...

You can use the modified text prompts as input to our model.

Citation

@misc{nvidia2025cosmostransfer1,
  title     = {Cosmos Transfer1: World Generation with Adaptive Multimodal Control},
  author    = {NVIDIA}, 
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.14492}
}

About

Cosmos-Transfer1-7B-Sample-AV Toolkits

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •