🌏 UniVLA

Important

🌟 Stay up to date at opendrivelab.com!

🌏 UniVLA

📄 Paper | 🚀 Demo Page (Coming Soon)

✒️ Qingwen Bu, Y. Yang, J. Cai, S. Gao, G. Ren, M. Yao, P. Luo, H. Li
📧 Primary Contact: Qingwen Bu ([email protected])

🔥 Highlights

A recipe towards generalist policy by planning in a unified, embodiment-agnostic action space.
A novel approach for extracting task-centric latent actions from cross-embodiment videos.
A VLA that achieves state-of-the-art results on multiple benchmarks with compute-efficient training.

🎥 Demo

Real-world robot experiments.

Store the screwdriver (1x speed)	Clean the cutting board (1x speed)	Fold towel twice (1x speed)
Task1.mp4	Task2.mp4	Task3.mp4
Stack the tower of hanoi (1x speed)
Task4_ours_success_case_1.mp4	Task4_ours_success_case_2.mp4	Task4_ours_success_case_3.mp4

📢 News

[2025/05] The code of UniVLA v1.0 is released. Please check it out!

📌 TODO list

1. 🤗 Checkpoints Release

2. 💪 Training and Evlauation Codes on Simulation Benchmarks

1) LIBERO
2) Room2Room
3) CALVIN
4) SimplerEnv

3. 💫 Codes and Guidelines for Real-world Deployment

Codes and Docs

4. 💁 Scripts for Pre-processing Human Dataset

Codes for converting Ego4D into RLDS format

🤗 Model Zoo

Model Name	Backbone	HF Path	Note
lam-stage-1	-	univla-latent-action-model	The stage-1 latent action model trained on OpenX and Ego4D.
lam-stage-2	-	univla-latent-action-model	The stage-2 latent action model trained on OpenX and Ego4D. (Generate task-centric latent actions.)
univla-7b	TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b	univla-7b	UniVLA pretrained on our full data collection (Manip. + Navi. + Human).
univla-7b-bridge-pt	TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b	univla-7b-bridge-pt	UniVLA pretrained only on BridgeV2 data.
univla-7b-human-pt	TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b	univla-7b-human-pt	UniVLA pretrained only on Ego4D human videos.
univla-7b-224-sft-libero	univla-7b	univla-7b-224-sft-libero	Finetuned on the LIBERO dataset
univla-7b-224-sft-calvin	univla-7b	univla-7b-224-sft-calvin	Finetuned on the CALVIN dataset

🎮 Getting Started

(Optional) We use conda to manage the environment.

conda create -n univla python=3.10 -y
conda activate univla

Install dependencies.

# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
# Our experiments are conducted with 'torch 2.2.0 + cuda 12.1'
pip install torch torchvision

# Clone our repo and pip install to download dependencies
git clone [email protected]:OpenDriveLab/UniVLA.git
cd univla
pip install -e .

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation

🔥 Training Recipe

1️⃣ Task-centric Latent Action Learning

We hightly recommond directly using our pre-trained latent action model ckeckpoints to save your time and compute.

Note

Our latent action model is trained on a comprehensive data collection, encompassing multiple robotic manipulation and navigation datasets from Open X-Embodiment, along with a curated subset of the Ego4D dataset (detailed data construction procedures are provided in the appendix of our paper).

To adapt the model to additional datasets or custom data sources, users may refer to ./prismatic/vla/datasets/rlds/oxe/mixtures.py to either utilize predefined data mixtures or define new ones. Subsequently, the data_mix parameter in the configuration file should be updated accordingly.

The latent action model is implemented based on VQ-VAE. We train the latent action model on the collection of dataset comprising robot manipulation, navigation and human videos. In stage-1 training, we use an overall batch size of 512 and 100k optimization steps to construct the task-irrelevant latent actions:

torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
    --config config/lam-stage-1.yaml \
    2>&1 | tee lam-stage-1.log

The following stage-2 then focuses on learning task-centric latent actions on the basis of stage-1 results. Please modify the stage_one_ckpt in latent_action_model/config/lam-stage-2.yaml to your local path of stage-1 checkpoint, then run training with:

torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
    --config config/lam-stage-2.yaml \
    2>&1 | tee lam-stage-2.log

2️⃣ Pretraining of Generalist Policy

Latent Action Pseudo-Labeling for Policy Optimization: The trained latent action model is employed to generate pseudo-labels for policy optimization via a next-token prediction objective. Specifically, the indices of inferred latent actions in the VQ-VAE codebook are mapped to dedicated tokens in the LLaMA tokenizer, denoted as {ACT_0, ACT_1, ..., ACT_C}.
Cost-effective Pre-Training: The full-scale pre-training procedure, incorporating both OpenX and Ego4D datasets, was performed using a 32-GPU A100 cluster over 20,000 optimization steps. This training regimen required approximately 960 A100 GPU-hours, representing just 5% of the computational resources utilized by OpenVLA. Furthermore, experiments conducted on the 'Bridge' and 'Human' subsets demanded only 200 GPU-hours, demonstrating substantially reduced computational requirements compared to previous vision-language-action models.
To initiate pre-training, please refer to the following scipt or simply run bash ./vla-scripts/train.sh:

Note

For pretraining UniVLA only on BridgeV2 or Human (Ego4D) data, please modify vla.type to prism-dinosiglip-224px+mx-bridge(human) correspondingly. Detailed setups can be found in ./prismatic/conf/vla.py.

### Experiment on a 32-GPU cluster
GPUS_PER_NODE=8  
NNODES=4
MASTER_PORT=${MASTER_PORT:-28596}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
RANK=${RANK:-0}

# Run your training script with torchrun
torchrun --nproc_per_node ${GPUS_PER_NODE} --nnodes ${NNODES} --node_rank ${RANK} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} train.py \
                                 --vla.type prism-dinosiglip-224px+mx-oxe-magic-soup-plus \
                                 --run_root_dir "vla_log" \

Once pretraining is complete, convert the UniVLA weights (default 'Prismatic' format) to HuggingFace AutoClasses with:

python vla-scripts/extern/convert_openvla_weights_to_hf.py \
    --openvla_model_path_or_id /path/to/your/pretrained_ckpt_path \
    --ckpt_name /path/to/your/specific_ckpt_name.pt \
    --output_hf_model_local_path /path/to/your/output_model_path

The converted model is then compatible with HF AutoClasses 'AutoModelForVision2Seq'.

3️⃣ Post-training for Deployment & Evaluations

With the pretrained generalist policy trained to plan over an embodiment-agnostic action space, we then add embodiment-specific action decoder heads for downstream deployment.
Our action decoder is extremely lightwight with only around 12M parameters. Using parameter efficient fine-tuning with LoRA rank 32, the total trainable parameter is around 123M.

🦾 Real-world Experiment

Our guidelines are based on real-device testing conducted on the AgiLex platform. If you have code deployed on other platforms or in different data formats, we welcome pull requests!

We provide a simple guideline to deploy UniVLA on your customized setups.

1) LIBERO

Please first download the LIBERO datasets that we used in experiments

Start training with torchrun:

You should first set the pretrained UniVLA and latent action model path in vla_path and lam_path of the training config.
Set your local LIBERO dataset path in data_root_dir.
You can choose dataset_name from libero_spatial_no_noops, libero_object_no_noops, libero_goal_no_noops, and libero_10_no_noops

We trained on 'Spatial', 'Object' and 'Goal' for 30k steps and 'Long' for 40k steps. Please first modify the max_steps in training config accordingly for reproduction.

# Start training on LIBERO-10(long) with 8 GPUs
torchrun --standalone --nnodes 1 --nproc-per-node 8 finetune_libero.py \
                                 --dataset_name "libero_10_no_noops" \
                                 --run_root_dir "libero_log" \

Once you finished training and get the action decoder and UniVLA backbone, you can start evaluation with:

# Start evaluation on LIBERO-10
# [Optional] Install LIBERO dependencies
pip install -r experiments/robot/libero/libero_requirements.txt

# By default, we test for 50 rollouts every task, totalling 500 independent trials.
python experiments/robot/libero/run_libero_eval.py \
    --task_suite_name libero_10 \   # Choose from [libero_spatial, libero_object, libero_goal, libero_10] 
    --action_decoder_path /path/to/your/action_decoder_path.pt \
    --pretrained_checkpoint /path/to/your/libero_10_finetuned_univla \
    --save_video False    # Whether to save rollout videos \
    --num_trials_per_task 50 \
    --seed 7

2) CALVIN

Please first follow CALVIN to install relavent dependencies and prepare your dataset

You should first set the pretrained UniVLA and latent action model path in vla_path and lam_path of the training config.
Set your local CALVIN directory path in calvin_root.
Start training with torchrun:

torchrun --standalone --nnodes 1 --nproc-per-node 8 finetune_calvin.py \
                                 --vla_path /path/to/your/univla-7b \
                                 --lam_path /path/to/your/lam-stage-2.ckpt \
                                 --calvin_root /path/to/yout/calvin_root_path \
                                 --max_steps 100000 \
                                 --batch_size 8 \
                                 --grad_accumulation_steps 2 \
                                 --window_size 12 \ 
                                 --run_root_dir "calvin_log"

Start evaluation on CALVIN:

# Mutli-GPU evaluation is supported
torchrun --standalone --nnodes 1 --nproc-per-node 8 experiments/robot/calvin/run_calvin_eval_ddp.py \
    --calvin_root /path/to/yout/calvin_root_path \
    --action_decoder_path /path/to/your/action_decoder_path.pt \
    --pretrained_checkpoint /path/to/your/calvin_finetuned_univla \
    --seed 7

To be updated.

🚀 UniVLA's Performance

Note

LIBERO Simulation Benchmark Results.

Model	LIBERO-Spatial		LIBERO-Object		LIBERO-Goal		LIBERO-Long		Average
Model	SR (↑)	Rank (↓)	SR (↑)	Rank (↓)	SR (↑)	Rank (↓)	SR (↑)	Rank (↓)	SR (↑)	Rank (↓)
Diffusion Policy	78.3 ± 1.1%	5	92.5 ± 0.7%	2	68.3 ± 1.2%	5	50.5 ± 1.3%	5	72.4 ± 0.7%	5
Octo	78.9 ± 1.0%	4	85.7 ± 0.9%	4	84.6 ± 0.9%	2	51.1 ± 1.3%	4	75.1 ± 0.6%	3
OpenVLA	84.7 ± 0.9%	2	88.4 ± 0.8%	3	79.2 ± 1.0%	3	53.7 ± 1.3%	3	76.5 ± 0.6%	2
TraceVLA	84.6 ± 0.2%	3	85.2 ± 0.4%	5	75.1 ± 0.3%	4	54.1 ± 1.0%	2	74.8 ± 0.5%	4
UniVLA (Ours)	96.5 ± 0.5%	1	96.8 ± 0.5%	1	95.6 ± 0.4%	1	92.0 ± 1.0%	1	95.2 ± 0.3%	1

Note

LIBERO Results with Limited Data. (Models are trained with 10%, 20%, 50%, and the full dataset)

Model	LIBERO-Goal				LIBERO-Long
Model	10%	20%	50%	100%	10%	20%	50%	100%
ATM	64.3%	77.1%	-	-	36.5%	39.1%	-	-
OpenVLA	61.4%	66.0%	77.0%	79.2%	11.6%	22.4%	36.6%	53.7%
OpenVLA-OFT	76.8%	88.2%	91.1%	96.2%	43.0%	62.2%	77.8%	90.7%
UniVLA (Ours)	86.3%	90.4%	93.1%	95.6%	62.4%	71.4%	87.0%	92.0%

Note

Real-world Experiments.

📝 Citation

If you find our code or models useful in your work, please cite our paper:

@article{bu2025univla,
  title={UniVLA: Learning to Act Anywhere with Task-centric Latent Actions}, 
  author={Qingwen Bu and Yanting Yang and Jisong Cai and Shenyuan Gao and Guanghui Ren and Maoqing Yao and Ping Luo and Hongyang Li},
  journal={arXiv preprint arXiv:2505.06111},
  year={2025}
}

Acknowledgements

We thank OpenVLA for their open-sourced work!

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
assets		assets
docs		docs
experiments/robot		experiments/robot
latent_action_model		latent_action_model
prismatic		prismatic
vla-scripts		vla-scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

🌏 UniVLA

📄 Paper | 🚀 Demo Page (Coming Soon)

🔥 Highlights

Table of Contents

🎥 Demo

📢 News

📌 TODO list

1. 🤗 Checkpoints Release

2. 💪 Training and Evlauation Codes on Simulation Benchmarks

3. 💫 Codes and Guidelines for Real-world Deployment

4. 💁 Scripts for Pre-processing Human Dataset

🤗 Model Zoo

🎮 Getting Started

🔥 Training Recipe

1️⃣ Task-centric Latent Action Learning

2️⃣ Pretraining of Generalist Policy

3️⃣ Post-training for Deployment & Evaluations

🦾 Real-world Experiment

1) LIBERO

2) CALVIN

🚀 UniVLA's Performance

📝 Citation

Acknowledgements

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors 4

Languages

Uh oh!

License

OpenDriveLab/UniVLA

Folders and files

Latest commit

History

Repository files navigation

🌏 UniVLA

📄 Paper | 🚀 Demo Page (Coming Soon)

🔥 Highlights

Table of Contents

🎥 Demo

📢 News

📌 TODO list

1. 🤗 Checkpoints Release

2. 💪 Training and Evlauation Codes on Simulation Benchmarks

3. 💫 Codes and Guidelines for Real-world Deployment

4. 💁 Scripts for Pre-processing Human Dataset

🤗 Model Zoo

🎮 Getting Started

🔥 Training Recipe

1️⃣ Task-centric Latent Action Learning

2️⃣ Pretraining of Generalist Policy

3️⃣ Post-training for Deployment & Evaluations

🦾 Real-world Experiment

1) LIBERO

2) CALVIN

🚀 UniVLA's Performance

📝 Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors 4

Languages