Skip to content

LoveJu1y/LaRA-VLA

Repository files navigation

[ICML 2026] LaRA-VLA

Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

Shuanghao Bai*, Jing Lyu*, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Badong Chen, Shanghang Zhang

Homepage arXiv GitHub Hugging Face Datasets Hugging Face Models License

LaRA-VLA overview

LaRA-VLA performs iterative latent reasoning by feeding hidden states back into reasoning slots before action prediction, rather than relying on long explicit chain-of-thought generation.

NEWS

  • 🎉 LaRA-VLA has been accepted to ICML 2026.
  • ✅ Training code is released.
  • ✅ Evaluation code is released.
  • ✅ Pretrained model weights are released.
  • ✅ Training datasets are released.

Installation

git clone https://github.com/LoveJu1y/LaRA-VLA
cd LaRA-VLA

conda create -n lara-vla python=3.10 -y
conda activate lara-vla

pip install -r requirements.txt
pip install -e .

Quick Start

1) Basic check

python -c "from laravla.training.train import main; print('OK')"

2) Multi-stage training for VLM

Before launching training, set the dataset roots and model cache path:

export BRIDGE_LEROBOT_ROOT=/path/to/bridge_datasets_parent
export LIBERO_LEROBOT_ROOT=/path/to/libero_lerobot
export HF_HOME=/path/to/qwen_cache

Dataset repos:

Model repos:

Bridge training expects:

${BRIDGE_LEROBOT_ROOT}/bridge_orig_lerobot/
  annotations/
  meta/
  data/
  videos/

The current public Bridge dataset release contains the core annotations and metadata, but does not include the raw videos/ directory. Bridge training will not run unless videos/ is available locally under the structure above.

LIBERO training expects:

${LIBERO_LEROBOT_ROOT}/
  libero_goal_no_noops_1.0.0_lerobot/
  libero_object_no_noops_1.0.0_lerobot/
  libero_spatial_no_noops_1.0.0_lerobot/
  libero_10_no_noops_1.0.0_lerobot/

Bridge:

bash scripts/run_bridge_multistage.sh

LIBERO:

bash scripts/run_libero_multistage.sh

3) Single-stage training for VLA

Bridge:

bash scripts/run_laravla_bridge.sh

LIBERO:

bash scripts/run_laravla_libero.sh

Evaluation

LIBERO

The LIBERO results above correspond to the evaluation workflow documented in examples/LIBERO/README.md.

Results

CoT Type Method Spatial Goal Object Long Avg
No CoT OpenVLA (Kim et al., 2025b) 84.7 88.4 79.2 53.7 76.5
π₀ (Black et al., 2024) 96.8 98.8 95.8 85.2 94.2
OpenVLA-OFT (Kim et al., 2025a) 97.6 98.4 97.9 94.5 97.1
Textual CoT ThinkAct (Huang et al., 2025) 88.3 91.4 87.1 70.9 84.4
MolmoAct (Lee et al., 2025) 87.0 95.4 87.6 77.2 86.6
π₀.₅ (Intelligence et al., 2025) 98.8 98.2 98.0 92.4 96.8
DeepThinkVLA (Yin et al., 2025) 99.0 96.6 96.4 96.2 97.0
Visual CoT CoT-VLA (Zhao et al., 2025) 87.5 91.6 87.6 69.0 81.1
DreamVLA (Zhang et al., 2025b) 97.5 94.0 89.5 89.5 92.6
F1 (Lv et al., 2025) 98.2 97.8 95.4 91.3 95.7
UD-VLA (Chen et al., 2025b) 94.1 95.7 91.2 89.6 92.7
Latent CoT Fast-ThinkAct (Huang et al., 2026) 92.0 97.2 90.2 79.4 89.7
LaRA-VLA (Ours) 96.4 98.6 99.8 96.6 97.9

SimplerEnv

The Bridge real-world results above are evaluated through the SimplerEnv-based pipeline documented in examples/SimplerEnv/README.md.

Results

CoT Type Method Put Spoon Put Carrot Stack Block Put Eggplant Avg
No CoT OpenVLA (Kim et al., 2025b) 0.0 0.0 0.0 4.1 1.0
Octo (Ghosh et al., 2024) 47.2 9.7 4.2 56.9 29.5
OpenVLA-OFT (Kim et al., 2025a) 12.5 4.2 8.3 37.5 39.6
π₀ (Black et al., 2024) 29.1 0.0 16.7 62.5 40.1
CogACT (Li et al., 2024) 71.7 50.8 15.0 67.5 51.3
Textual CoT ThinkAct (Huang et al., 2025) 58.3 37.5 8.7 70.8 43.8
Visual CoT F1 (Lv et al., 2025) 50.0 70.8 50.0 66.7 59.4
UD-VLA (Chen et al., 2025b) 58.3 62.5 54.1 75.0 62.5
Latent CoT LaRA-VLA (Ours) 95.8 62.5 25.0 91.7 68.8

Acknowledgments

Our code builds on the open-source StarVLA codebase, and incorporates ideas and components from Coconut and ECOT (Embodied Chain-of-Thought).

Citation

@article{bai2026latentreasoningvla,
  title={Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models},
  author={Bai, Shuanghao and Lyu, Jing and Zhou, Wanqi and Li, Zhe and Wang, Dakai and Xing, Lei and Zhao, Xiaoguang and Wang, Pengwei and Wang, Zhongyuan and Chi, Cheng and Chen, Badong and Zhang, Shanghang},
  journal={arXiv preprint arXiv:2602.01166},
  year={2026}
}

License

Released under the MIT License. See LICENSE.

About

[ICML 2026] Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors