Skip to content

[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions

License

Notifications You must be signed in to change notification settings

OpenDriveLab/UniVLA

Repository files navigation

🌏 UniVLA

This is the official implementation of our RSS 2025 paper:
Learning to Act Anywhere with Task-centric Latent Actions

Overview of UniVLA:

We introduce UniVLA, a unified vision-language-action framework that enables policy learning across different environments. By deriving task-centric latent actions in an unsupervised manner, UniVLA can leverage data from arbitrary embodiments and perspectives without action labels. After large-scale pretraining from videos, UniVLA develops a cross-embodiment generalist policy that can be readily deployed across various robots by learning an action decoding with minimal cost. Compared to OpenVLA, UniVLA exhibits unanimous improvement on multiple manipulation and navigation tasks.

✒️ Qingwen Bu, Yanting Yang, Jisong Cai, S. Gao, G. Ren, M. Yao, P. Luo, H. Li
📧 Primary Contact: Qingwen Bu ([email protected])

🔥 Highlights

  • A recipe towards generalist policy by planning in a unified, embodiment-agnostic action space.
  • A novel approach for extracting task-centric latent actions from cross-embodiment videos.
  • A VLA that achieves state-of-the-art results on multiple benchmarks with compute-efficient training.

🎥 Demo

Real-world robot experiments.

Store the screwdriver (1x speed) Clean the cutting board (1x speed) Fold towel twice (1x speed)
Task1.mp4
Task2.mp4
Task3.mp4
Stack the tower of hanoi (1x speed)
Task4_ours_success_case_1.mp4
Task4_ours_success_case_2.mp4
Task4_ours_success_case_3.mp4

📢 News

  • [2025/05] The code of UniVLA v1.0 is released. Please check it out!

📌 TODO list

1. 🤗 Checkpoints Release

  • 1) Latent action model
  • 2) Pre-trained Models
    • Full (Manip. + Navi. + Human)
    • BridgeV2-Only
    • Human-Only
  • 3) Downstream Fine-tuned Models
    • LIBERO
    • Room2Room
    • CALVIN
    • SimplerEnv

2. 💪 Training and Evlauation Codes on Simulation Benchmarks

  • 1) LIBERO
  • 2) Room2Room
  • 3) CALVIN
  • 4) SimplerEnv

3. 💫 Codes and Guidelines for Real-world Deployment

  • Codes and Docs

4. 💁 Scripts for Pre-processing Human Dataset

  • Codes for converting Ego4D into RLDS format

🤗 Model Zoo

Model Name Backbone HF Path Note
lam-stage-1 - univla-latent-action-model The stage-1 latent action model trained on OpenX and Ego4D.
lam-stage-2 - univla-latent-action-model The stage-2 latent action model trained on OpenX and Ego4D. (Generate task-centric latent actions.)
univla-7b TRI-ML/prismatic-vlms/prism-dinosiglip-224px+7b univla-7b UniVLA pretrained on our full data collection (Manip. + Navi. + Human).
univla-7b-224-sft-libero univla-7b univla-7b-224-sft-libero Finetuned on the LIBERO dataset

🎮 Getting Started

  1. (Optional) We use conda to manage the environment.
conda create -n univla python=3.10 -y
conda activate univla
  1. Install dependencies.
# Install pytorch
# Look up https://pytorch.org/get-started/previous-versions/ with your cuda version for a correct command
# Our experiments are conducted with 'torch 2.2.0 + cuda 12.1'
pip install torch torchvision

# Clone our repo and pip install to download dependencies
git clone [email protected]:OpenDriveLab/UniVLA.git
cd univla
pip install -e .

# Install Flash Attention 2 for training (https://github.com/Dao-AILab/flash-attention)
pip install packaging ninja
ninja --version; echo $?  # Verify Ninja --> should return exit code "0"
pip install "flash-attn==2.5.5" --no-build-isolation

🔥 Training Recipe

1️⃣ Task-centric Latent Action Learning

We hightly recommond directly using our pre-trained latent action model ckeckpoints to save your time and compute.

Note

Our latent action model is trained on a comprehensive data collection, encompassing multiple robotic manipulation and navigation datasets from Open X-Embodiment, along with a curated subset of the Ego4D dataset (detailed data construction procedures are provided in the appendix of our paper).

To adapt the model to additional datasets or custom data sources, users may refer to ./prismatic/vla/datasets/rlds/oxe/mixtures.py to either utilize predefined data mixtures or define new ones. Subsequently, the data_mix parameter in the configuration file should be updated accordingly.

The latent action model is implemented based on VQ-VAE. We train the latent action model on the collection of dataset comprising robot manipulation, navigation and human videos. In stage-1 training, we use an overall batch size of 512 and 100k optimization steps to construct the task-irrelevant latent actions:

torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
    --config config/lam-stage-1.yaml \
    2>&1 | tee lam-stage-1.log

The following stage-2 then focuses on learning task-centric latent actions on the basis of stage-1 results. Please modify the stage_one_ckpt in latent_action_model/config/lam-stage-2.yaml to your local path of stage-1 checkpoint, then run training with:

torchrun --standalone --nnodes 1 --nproc-per-node 8 main.py fit \
    --config config/lam-stage-2.yaml \
    2>&1 | tee lam-stage-2.log

2️⃣ Pretraining of Generalist Policy

  • Latent Action Pseudo-Labeling for Policy Optimization: The trained latent action model is employed to generate pseudo-labels for policy optimization via a next-token prediction objective. Specifically, the indices of inferred latent actions in the VQ-VAE codebook are mapped to dedicated tokens in the LLaMA tokenizer, denoted as {ACT_0, ACT_1, ..., ACT_C}.

  • Cost-effective Pre-Training: Full-scale pre-training (combining OpenX and Ego4D datasets) was conducted on a 32-GPU A100 cluster for 20,000 optimization steps. In contrast, experiments on the 'Bridge' and 'Human' subsets required only 8 A100 GPUs, totaling 200 GPU-hours, significantly fewer computational resources than prior vision-language-action models.

  • To initiate pre-training, please refer to the following scipt or simply run bash ./vla-scripts/train.sh:

Note

For pretraining UniVLA only on BridgeV2 or Human (Ego4D) data, please modify vla.type to prism-dinosiglip-224px+mx-bridge(human) correspondingly. Detailed setups can be found in ./prismatic/conf/vla.py.

GPUS_PER_NODE=8  
NNODES=4
MASTER_PORT=${MASTER_PORT:-28596}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
RANK=${RANK:-0}

# Run your training script with torchrun
torchrun --nproc_per_node ${GPUS_PER_NODE} --nnodes ${NNODES} --node_rank ${RANK} --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} train.py \
                                 --vla.type prism-dinosiglip-224px+mx-oxe-magic-soup-plus \
                                 --run_root_dir "vla_log" \

3️⃣ Post-training for Deployment & Evaluations

  • With the pretrained generalist policy trained to plan over an embodiment-agnostic action space, we then add embodiment-specific action decoder heads for downstream deployment.
  • Our action decoder is extremely lightwight with only around 12M parameters. Using parameter efficient fine-tuning with LoRA rank 32, the total trainable parameter is around 123M.

1) LIBERO

Please first download the LIBERO datasets that we used in experiments

Start training with torchrun:

  1. You should first set the pretrained UniVLA and latent action model path in vla_path and lam_path of the training config.
  2. Set your local LIBERO dataset path in data_root_dir.
  3. You can choose dataset_name from libero_spatial_no_noops, libero_object_no_noops, libero_goal_no_noops, and libero_10_no_noops

We trained on 'Spatial', 'Object' and 'Goal' for 30k steps and 'Long' for 40k steps. Please first modify the max_steps in training config accordingly for reproduction.

# Start training on LIBERO-10(long) with 8 GPUs
torchrun --standalone --nnodes 1 --nproc-per-node 8 finetune_libero.py \
                                 --dataset_name "libero_10_no_noops" \
                                 --run_root_dir "libero_log" \

Once you finished training and get the action decoder and VLA backbone, you can simply start evaluation with:

# Start evaluation on LIBERO-10
# By default, we test for 50 rollouts every task, totalling 500 independent trials.
python experiments/robot/libero/run_libero_eval_decoder.py \
    --task_suite_name libero_10    # Choose from [libero_spatial, libero_object, libero_goal, libero_10] \
    --action_decoder_path /path/to/your/action_decoder_path.pt \
    --pretrained_checkpoint /path/to/your/libero_10_finetuned_univla \
    --save_video False    # Whether to save rollout videos \
    --seed 7

To be updated.

🚀 UniVLA's Performance

Note

LIBERO Simulation Benchmark Results.

Model LIBERO-Spatial LIBERO-Object LIBERO-Goal LIBERO-Long Average
SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓) SR (↑) Rank (↓)
Diffusion Policy 78.3 ± 1.1% 5 92.5 ± 0.7% 2 68.3 ± 1.2% 5 50.5 ± 1.3% 5 72.4 ± 0.7% 5
Octo 78.9 ± 1.0% 4 85.7 ± 0.9% 4 84.6 ± 0.9% 2 51.1 ± 1.3% 4 75.1 ± 0.6% 3
OpenVLA 84.7 ± 0.9% 2 88.4 ± 0.8% 3 79.2 ± 1.0% 3 53.7 ± 1.3% 3 76.5 ± 0.6% 2
TraceVLA 84.6 ± 0.2% 3 85.2 ± 0.4% 5 75.1 ± 0.3% 4 54.1 ± 1.0% 2 74.8 ± 0.5% 4
UniVLA (Ours) 96.5 ± 0.5% 1 96.8 ± 0.5% 1 95.6 ± 0.4% 1 92.0 ± 1.0% 1 95.2 ± 0.3% 1

Note

LIBERO Results with Limited Data. (Models are trained with 10%, 20%, 50%, and the full dataset)

Model LIBERO-Goal LIBERO-Long
10% 20% 50% 100% 10% 20% 50% 100%
ATM 64.3% 77.1% - - 36.5% 39.1% - -
OpenVLA 61.4% 66.0% 77.0% 79.2% 11.6% 22.4% 36.6% 53.7%
OpenVLA-OFT 76.8% 88.2% 91.1% 96.2% 43.0% 62.2% 77.8% 90.7%
UniVLA (Ours) 86.3% 90.4% 93.1% 95.6% 62.4% 71.4% 87.0% 92.0%

Note

Real-world Experiments.

📝 Citation

If you find our code or models useful in your work, please cite our paper:

@article{bu2025univla,
  title={UniVLA: Learning to Act Anywhere with Task-centric Latent Actions}, 
  author={Qingwen Bu and Yanting Yang and Jisong Cai and Shenyuan Gao and Guanghui Ren and Maoqing Yao and Ping Luo and Hongyang Li},
  journal={arXiv preprint arXiv:2505.06111},
  year={2025}
}

Acknowledgements

We thank OpenVLA for their open-sourced work!

About

[RSS 2025] Learning to Act Anywhere with Task-centric Latent Actions

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Contributors 3

  •  
  •  
  •