This is the official implementation of our RSS 2025 paper:
Learning to Act Anywhere with Task-centric Latent Actions
We introduce UniVLA, a unified vision-language-action framework that enables policy learning across different environments. By deriving task-centric latent actions in an unsupervised manner, UniVLA can leverage data from arbitrary embodiments and perspectives without action labels. After large-scale pretraining from videos, UniVLA develops a cross-embodiment generalist policy that can be readily deployed across various robots by learning an action decoding with minimal cost. Compared to OpenVLA, UniVLA exhibits unanimous improvement on multiple manipulation and navigation tasks.
📄 Paper | 🚀 Demo Page (Coming Soon)
✒️ Qingwen Bu, Yanting Yang, Jisong Cai, S. Gao, G. Ren, M. Yao, P. Luo, H. Li
📧 Primary Contact: Qingwen Bu ([email protected])
- A recipe towards generalist policy by planning in a unified, embodiment-agnostic action space.
- A novel approach for extracting task-centric latent actions from cross-embodiment videos.
- A VLA that achieves state-of-the-art results on multiple benchmarks with compute-efficient training.
Real-world robot experiments.
Store the screwdriver (1x speed) | Clean the cutting board (1x speed) | Fold towel twice (1x speed) |
Task1.mp4 |
Task2.mp4 |
Task3.mp4 |
Stack the tower of hanoi (1x speed) | ||
Task4_ours_success_case_1.mp4 |
Task4_ours_success_case_2.mp4 |
Task4_ours_success_case_3.mp4 |
- [2025/05] The code of UniVLA v1.0 is released. Please check it out!
- 1) Latent action model
- 2) Pre-trained Models
- Full (Manip. + Navi. + Human)
- BridgeV2-Only
- Human-Only
- 3) Downstream Fine-tuned Models
- LIBERO
- Room2Room
- CALVIN
- SimplerEnv
- 1) LIBERO
- 2) Room2Room
- 3) CALVIN
- 4) SimplerEnv
- Codes and Docs
- Codes for converting Ego4D into RLDS format
Model Name | Backbone | HF Path | Note |
---|---|---|---|
lam-stage-1 | - | univla-latent-action-model | The stage-1 latent action model trained on OpenX and Ego4D. |
lam-stage-2 | - | univla-latent-action-model | The stage-2 latent action model trained on OpenX and Ego4D. (Generate task-centric latent actions.) |
univla-7b | TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b | univla-7b | UniVLA pretrained on our full data collection (Manip. + Navi. + Human). |
univla-7b-224-sft-libero | univla-7b | univla-7b-224-sft-libero | Finetuned on the LIBERO dataset |
- (Optional) We use conda to manage the environment.
conda create -n univla python=3.10 -y
conda activate univla
- Install dependencies.
# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
# Our experiments are conducted with 'torch 2.2.0 + cuda 12.1'
pip install torch torchvision
# Clone our repo and pip install to download dependencies
git clone [email protected]:OpenDriveLab/UniVLA.git
cd univla
pip install -e .
# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install packaging ninja
ninja --version; echo $? # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation
We hightly recommond directly using our pre-trained latent action model ckeckpoints to save your time and compute.
Note
Our latent action model is trained on a comprehensive data collection, encompassing multiple robotic manipulation and navigation datasets from Open X-Embodiment, along with a curated subset of the Ego4D dataset (detailed data construction procedures are provided in the appendix of our paper).
To adapt the model to additional datasets or custom data sources, users may refer to ./prismatic/vla/datasets/rlds/oxe/mixtures.py
to either utilize predefined data mixtures or define new ones. Subsequently, the data_mix
parameter in the configuration file should be updated accordingly.
The latent action model is implemented based on VQ-VAE. We train the latent action model on the collection of dataset comprising robot manipulation, navigation and human videos. In stage-1 training, we use an overall batch size of 512 and 100k optimization steps to construct the task-irrelevant latent actions:
torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
--config config/lam-stage-1.yaml \
2>&1 | tee lam-stage-1.log
The following stage-2 then focuses on learning task-centric latent actions on the basis of stage-1 results. Please modify the stage_one_ckpt
in latent_action_model/config/lam-stage-2.yaml
to your local path of stage-1 checkpoint, then run training with:
torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
--config config/lam-stage-2.yaml \
2>&1 | tee lam-stage-2.log
-
Latent Action Pseudo-Labeling for Policy Optimization: The trained latent action model is employed to generate pseudo-labels for policy optimization via a next-token prediction objective. Specifically, the indices of inferred latent actions in the VQ-VAE codebook are mapped to dedicated tokens in the LLaMA tokenizer, denoted as
{ACT_0, ACT_1, ..., ACT_C}
. -
Cost-effective Pre-Training: Full-scale pre-training (combining OpenX and Ego4D datasets) was conducted on a 32-GPU A100 cluster for 20,000 optimization steps. In contrast, experiments on the 'Bridge' and 'Human' subsets required only 8 A100 GPUs, totaling 200 GPU-hours, significantly fewer computational resources than prior vision-language-action models.
-
To initiate pre-training, please refer to the following scipt or simply run
bash ./vla-scripts/train.sh
:
Note
For pretraining UniVLA only on BridgeV2 or Human (Ego4D) data, please modify vla.type
to prism-dinosiglip-224px+mx-bridge(human)
correspondingly. Detailed setups can be found in ./prismatic/conf/vla.py
.
GPUS_PER_NODE=8
NNODES=4
MASTER_PORT=${MASTER_PORT:-28596}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
RANK=${RANK:-0}
# Run your training script with torchrun
torchrun --nproc_per_node ${GPUS_PER_NODE} --nnodes ${NNODES} --node_rank ${RANK} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} train.py \
--vla.type prism-dinosiglip-224px+mx-oxe-magic-soup-plus \
--run_root_dir "vla_log" \
- With the pretrained generalist policy trained to plan over an embodiment-agnostic action space, we then add embodiment-specific action decoder heads for downstream deployment.
- Our action decoder is extremely lightwight with only around 12M parameters. Using parameter efficient fine-tuning with LoRA rank 32, the total trainable parameter is around 123M.
Please first download the LIBERO datasets that we used in experiments
Start training with torchrun
:
- You should first set the pretrained UniVLA and latent action model path in
vla_path
andlam_path
of the training config. - Set your local LIBERO dataset path in
data_root_dir
. - You can choose
dataset_name
fromlibero_spatial_no_noops
,libero_object_no_noops
,libero_goal_no_noops
, andlibero_10_no_noops
We trained on 'Spatial', 'Object' and 'Goal' for 30k steps and 'Long' for 40k steps. Please first modify the
max_steps
in training config accordingly for reproduction.
# Start training on LIBERO-10(long) with 8 GPUs
torchrun --standalone --nnodes 1 --nproc-per-node 8 finetune_libero.py \
--dataset_name "libero_10_no_noops" \
--run_root_dir "libero_log" \
Once you finished training and get the action decoder and VLA backbone, you can simply start evaluation with:
# Start evaluation on LIBERO-10
# By default, we test for 50 rollouts every task, totalling 500 independent trials.
python experiments/robot/libero/run_libero_eval_decoder.py \
--task_suite_name libero_10 # Choose from [libero_spatial, libero_object, libero_goal, libero_10] \
--action_decoder_path /path/to/your/action_decoder_path.pt \
--pretrained_checkpoint /path/to/your/libero_10_finetuned_univla \
--save_video False # Whether to save rollout videos \
--seed 7
To be updated.
Note
LIBERO Simulation Benchmark Results.
Model | LIBERO-Spatial | LIBERO-Object | LIBERO-Goal | LIBERO-Long | Average | |||||
---|---|---|---|---|---|---|---|---|---|---|
SR (↑) | Rank (↓) | SR (↑) | Rank (↓) | SR (↑) | Rank (↓) | SR (↑) | Rank (↓) | SR (↑) | Rank (↓) | |
Diffusion Policy | 78.3 ± 1.1% | 5 | 92.5 ± 0.7% | 2 | 68.3 ± 1.2% | 5 | 50.5 ± 1.3% | 5 | 72.4 ± 0.7% | 5 |
Octo | 78.9 ± 1.0% | 4 | 85.7 ± 0.9% | 4 | 84.6 ± 0.9% | 2 | 51.1 ± 1.3% | 4 | 75.1 ± 0.6% | 3 |
OpenVLA | 84.7 ± 0.9% | 2 | 88.4 ± 0.8% | 3 | 79.2 ± 1.0% | 3 | 53.7 ± 1.3% | 3 | 76.5 ± 0.6% | 2 |
TraceVLA | 84.6 ± 0.2% | 3 | 85.2 ± 0.4% | 5 | 75.1 ± 0.3% | 4 | 54.1 ± 1.0% | 2 | 74.8 ± 0.5% | 4 |
UniVLA (Ours) | 96.5 ± 0.5% | 1 | 96.8 ± 0.5% | 1 | 95.6 ± 0.4% | 1 | 92.0 ± 1.0% | 1 | 95.2 ± 0.3% | 1 |
Note
LIBERO Results with Limited Data. (Models are trained with 10%, 20%, 50%, and the full dataset)
Model | LIBERO-Goal | LIBERO-Long | ||||||
---|---|---|---|---|---|---|---|---|
10% | 20% | 50% | 100% | 10% | 20% | 50% | 100% | |
ATM | 64.3% | 77.1% | - | - | 36.5% | 39.1% | - | - |
OpenVLA | 61.4% | 66.0% | 77.0% | 79.2% | 11.6% | 22.4% | 36.6% | 53.7% |
OpenVLA-OFT | 76.8% | 88.2% | 91.1% | 96.2% | 43.0% | 62.2% | 77.8% | 90.7% |
UniVLA (Ours) | 86.3% | 90.4% | 93.1% | 95.6% | 62.4% | 71.4% | 87.0% | 92.0% |
Note
Real-world Experiments.
If you find our code or models useful in your work, please cite our paper:
@article{bu2025univla,
title={UniVLA: Learning to Act Anywhere with Task-centric Latent Actions},
author={Qingwen Bu and Yanting Yang and Jisong Cai and Shenyuan Gao and Guanghui Ren and Maoqing Yao and Ping Luo and Hongyang Li},
journal={arXiv preprint arXiv:2505.06111},
year={2025}
}
We thank OpenVLA for their open-sourced work!