Skip to content

PKU-Alignment/VLA-Arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

๐Ÿค– VLA-Arena: A Comprehensive Benchmark for Vision-Language-Action Models

License Python Framework Tasks Docs

VLA-Arena is an open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models. VLA-Arena provides a full toolchain covering scenes modeling, demonstrations collection, models training and evaluation. It features 150+ tasks across 13 specialized suites, hierarchical difficulty levels (L0-L2), and comprehensive metrics for safety, generalization, and efficiency assessment.

VLA-Arena focuses on four key domains:

  • Safety: Operate reliably and safely in the physical world.
  • Distractors: Maintain stable performance when facing environmental unpredictability.
  • Extrapolation: Generalize learned knowledge to novel situations.
  • Long Horizon: Combine long sequences of actions to achieve a complex goal.

๐Ÿ“ฐ News

2025.09.29: VLA-Arena is officially released!

๐Ÿ”ฅ Highlights

  • ๐Ÿš€ End-to-End & Out-of-the-Box: We provide a complete and unified toolchain covering everything from scene modeling and behavior collection to model training and evaluation. Paired with comprehensive docs and tutorials, you can get started in minutes.
  • ๐Ÿ”Œ Plug-and-Play Evaluation: Seamlessly integrate and benchmark your own VLA models. Our framework is designed with a unified API, making the evaluation of new architectures straightforward with minimal code changes.
  • ๐Ÿ› ๏ธ Effortless Task Customization: Leverage the Constrained Behavior Domain Definition Language (CBDDL) to rapidly define entirely new tasks and safety constraints. Its declarative nature allows you to achieve comprehensive scenario coverage with minimal effort.
  • ๐Ÿ“Š Systematic Difficulty Scaling: Systematically assess model capabilities across three distinct difficulty levels (L0โ†’L1โ†’L2). Isolate specific skills and pinpoint failure points, from basic object manipulation to complex, long-horizon tasks.

If you find VLA-Arena useful, please cite it in your publications.

@misc{vla-arena2025,
  title={VLA-Arena},
  author={Jiahao Li, Borong Zhang, Jiachen Shen, Jiaming Ji, and Yaodong Yang},
  journal={GitHub repository},
  year={2025}
}

๐Ÿ“š Table of Contents

Quick Start

1. Installation

Install from PyPI (Recommended)

# 1. Install VLA-Arena
pip install vla-arena

# 2. Download task suites (required)
vla-arena.download-tasks install-all --repo vla-arena/tasks

๐Ÿ“ฆ Important: To reduce PyPI package size, task suites and asset files must be downloaded separately after installation (~850 MB).

Install from Source

# Clone repository (includes all tasks and assets)
git clone https://github.com/PKU-Alignment/VLA-Arena.git
cd VLA-Arena

# Create environment
conda create -n vla-arena python=3.11
conda activate vla-arena

# Install VLA-Arena
pip install -e .

Notes

  • The mujoco.dll file may be missing in the robosuite/utils directory, which can be obtained from mujoco/mujoco.dll;
  • When using on Windows platform, you need to modify the mujoco rendering method in robosuite\utils\binding_utils.py:
    if _SYSTEM == "Darwin":
      os.environ["MUJOCO_GL"] = "cgl"
    else:
      os.environ["MUJOCO_GL"] = "wgl"    # Change "egl" to "wgl"

2. Data Collection

# Collect demonstration data
python scripts/collect_demonstration.py --bddl-file tasks/your_task.bddl

This will open an interactive simulation environment where you can control the robotic arm using keyboard controls to complete the task specified in the BDDL file.

3. Model Fine-tuning and Evaluation

โš ๏ธ Important: We recommend creating separate conda environments for different models to avoid dependency conflicts. Each model may have different requirements.

# Create a dedicated environment for the model
conda create -n [model_name]_vla_arena python=3.11 -y
conda activate [model_name]_vla_arena

# Install VLA-Arena and model-specific dependencies
pip install -e .
pip install vla-arena[model_name]

# Fine-tune a model (e.g., OpenVLA)
vla-arena train --model openvla --config vla_arena/configs/train/openvla.yaml

# Evaluate a model
vla-arena eval --model openvla --config vla_arena/configs/evaluation/openvla.yaml

Note: OpenPi requires a different setup process using uv for environment management. Please refer to the Model Fine-tuning and Evaluation Guide for detailed OpenPi installation and training instructions.

Task Suites Overview

VLA-Arena provides 11 specialized task suites with 150+ tasks total, organized into four domains:

๐Ÿ›ก๏ธ Safety (5 suites, 75 tasks)

Suite Description L0 L1 L2 Total
static_obstacles Static collision avoidance 5 5 5 15
cautious_grasp Safe grasping strategies 5 5 5 15
hazard_avoidance Hazard area avoidance 5 5 5 15
state_preservation Object state preservation 5 5 5 15
dynamic_obstacles Dynamic collision avoidance 5 5 5 15

๐Ÿ”„ Distractor (2 suites, 30 tasks)

Suite Description L0 L1 L2 Total
static_distractors Cluttered scene manipulation 5 5 5 15
dynamic_distractors Dynamic scene manipulation 5 5 5 15

๐ŸŽฏ Extrapolation (3 suites, 45 tasks)

Suite Description L0 L1 L2 Total
preposition_combinations Spatial relationship understanding 5 5 5 15
task_workflows Multi-step task planning 5 5 5 15
unseen_objects Unseen object recognition 5 5 5 15

๐Ÿ“ˆ Long Horizon (1 suite, 20 tasks)

Suite Description L0 L1 L2 Total
long_horizon Long-horizon task planning 10 5 5 20

Difficulty Levels:

  • L0: Basic tasks with clear objectives
  • L1: Intermediate tasks with increased complexity
  • L2: Advanced tasks with challenging scenarios

๐Ÿ›ก๏ธ Safety Suites Visualization

Suite Name L0 L1 L2
Static Obstacles
Cautious Grasp
Hazard Avoidance
State Preservation
Dynamic Obstacles

๐Ÿ”„ Distractor Suites Visualization

Suite Name L0 L1 L2
Static Distractors
Dynamic Distractors

๐ŸŽฏ Extrapolation Suites Visualization

Suite Name L0 L1 L2
Preposition Combinations
Task Workflows
Unseen Objects

๐Ÿ“ˆ Long Horizon Suite Visualization

Suite Name L0 L1 L2
Long Horizon

Installation

System Requirements

  • OS: Ubuntu 20.04+ or macOS 12+
  • Python: 3.11 or higher
  • CUDA: 11.8+ (for GPU acceleration)

Installation Steps

# Clone repository
git clone https://github.com/PKU-Alignment/VLA-Arena.git
cd VLA-Arena

# Create environment
conda create -n vla-arena python=3.11
conda activate vla-arena

# Install dependencies
pip install --upgrade pip
pip install -e .

Documentation

VLA-Arena provides comprehensive documentation for all aspects of the framework. Choose the guide that best fits your needs:

๐Ÿ“– Core Guides

Build custom task scenarios using CBDDL (Constrained Behavior Domain Definition Language).

  • CBDDL file structure and syntax
  • Region, fixture, and object definitions
  • Moving objects with various motion types (linear, circular, waypoint, parabolic)
  • Initial and goal state specifications
  • Cost constraints and safety predicates
  • Image effect settings
  • Asset management and registration
  • Scene visualization tools

Collect demonstrations in custom scenes and convert data formats.

  • Interactive simulation environment with keyboard controls
  • Demonstration data collection workflow
  • Data format conversion (HDF5 to training dataset)
  • Dataset regeneration (filtering noops and optimizing trajectories)
  • Convert dataset to RLDS format (for X-embodiment frameworks)
  • Convert RLDS dataset to LeRobot format (for Hugging Face LeRobot)

Fine-tune and evaluate VLA models using VLA-Arena generated datasets.

  • General models (OpenVLA, OpenVLA-OFT, UniVLA, SmolVLA): Simple installation and training workflow
  • OpenPi: Special setup using uv for environment management
  • Model-specific installation instructions (pip install vla-arena[model_name])
  • Training configuration and hyperparameter settings
  • Evaluation scripts and metrics
  • Policy server setup for inference (OpenPi)

๐Ÿ”œ Quick Reference

Fine-tuning Scripts

Documentation Index

  • English: README_EN.md - Complete English documentation index
  • ไธญๆ–‡: README_ZH.md - ๅฎŒๆ•ดไธญๆ–‡ๆ–‡ๆกฃ็ดขๅผ•

๐Ÿ“ฆ Download Task Suites

Method 1: Using CLI Tool (Recommended)

After installation, you can use the following commands to view and download task suites:

# View installed tasks
vla-arena.download-tasks installed

# List available task suites
vla-arena.download-tasks list --repo vla-arena/tasks

# Install a single task suite
vla-arena.download-tasks install robustness_dynamic_distractors --repo vla-arena/tasks

# Install all task suites (recommended)
vla-arena.download-tasks install-all --repo vla-arena/tasks

Method 2: Using Python Script

# View installed tasks
python -m scripts.download_tasks installed

# Install all tasks
python -m scripts.download_tasks install-all --repo vla-arena/tasks

๐Ÿ”ง Custom Task Repository

If you want to use your own task repository:

# Use custom HuggingFace repository
vla-arena.download-tasks install-all --repo your-username/your-task-repo

๐Ÿ“ Create and Share Custom Tasks

You can create and share your own task suites:

# Package a single task
vla-arena.manage-tasks pack path/to/task.bddl --output ./packages

# Package all tasks
python scripts/package_all_suites.py --output ./packages

# Upload to HuggingFace Hub
vla-arena.manage-tasks upload ./packages/my_task.vlap --repo your-username/your-repo

Leaderboard

Performance Evaluation of VLA Models on the VLA-Arena Benchmark

We compare six models across four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Performance trends over three difficulty levels (L0โ€“L2) are shown with a unified scale (0.0โ€“1.0) for cross-model comparison. Safety tasks report both cumulative cost (CC, shown in parentheses) and success rate (SR), while other tasks report only SR. Bold numbers mark the highest performance per difficulty level.

๐Ÿ›ก๏ธ Safety Performance

Task OpenVLA OpenVLA-OFT ฯ€โ‚€ ฯ€โ‚€-FAST UniVLA SmolVLA
StaticObstacles
L0 1.00 (CC: 0.0) 1.00 (CC: 0.0) 0.98 (CC: 0.0) 1.00 (CC: 0.0) 0.84 (CC: 0.0) 0.14 (CC: 0.0)
L1 0.60 (CC: 8.2) 0.20 (CC: 45.4) 0.74 (CC: 8.0) 0.40 (CC: 56.0) 0.42 (CC: 9.7) 0.00 (CC: 8.8)
L2 0.00 (CC: 38.2) 0.20 (CC: 49.0) 0.32 (CC: 28.1) 0.20 (CC: 6.8) 0.18 (CC: 60.6) 0.00 (CC: 2.6)
CautiousGrasp
L0 0.80 (CC: 6.6) 0.60 (CC: 3.3) 0.84 (CC: 3.5) 0.64 (CC: 3.3) 0.80 (CC: 3.3) 0.52 (CC: 2.8)
L1 0.40 (CC: 120.2) 0.50 (CC: 6.3) 0.08 (CC: 16.4) 0.06 (CC: 15.6) 0.60 (CC: 52.1) 0.28 (CC: 30.7)
L2 0.00 (CC: 50.1) 0.00 (CC: 2.1) 0.00 (CC: 0.5) 0.00 (CC: 1.0) 0.00 (CC: 8.5) 0.04 (CC: 0.3)
HazardAvoidance
L0 0.20 (CC: 17.2) 0.36 (CC: 9.4) 0.74 (CC: 6.4) 0.16 (CC: 10.4) 0.70 (CC: 5.3) 0.16 (CC: 10.4)
L1 0.02 (CC: 22.8) 0.00 (CC: 22.9) 0.00 (CC: 16.8) 0.00 (CC: 15.4) 0.12 (CC: 18.3) 0.00 (CC: 19.5)
L2 0.20 (CC: 15.7) 0.20 (CC: 14.7) 0.00 (CC: 15.6) 0.20 (CC: 13.9) 0.04 (CC: 16.7) 0.00 (CC: 18.0)
StatePreservation
L0 1.00 (CC: 0.0) 1.00 (CC: 0.0) 0.98 (CC: 0.0) 0.60 (CC: 0.0) 0.90 (CC: 0.0) 0.50 (CC: 0.0)
L1 0.66 (CC: 6.6) 0.76 (CC: 7.6) 0.64 (CC: 6.4) 0.56 (CC: 5.6) 0.76 (CC: 7.6) 0.18 (CC: 1.8)
L2 0.34 (CC: 21.0) 0.20 (CC: 4.6) 0.48 (CC: 15.8) 0.20 (CC: 4.2) 0.54 (CC: 16.4) 0.08 (CC: 9.6)
DynamicObstacles
L0 0.60 (CC: 3.6) 0.80 (CC: 8.8) 0.92 (CC: 6.0) 0.80 (CC: 3.6) 0.26 (CC: 7.1) 0.32 (CC: 2.1)
L1 0.60 (CC: 5.1) 0.56 (CC: 3.7) 0.64 (CC: 3.3) 0.30 (CC: 8.8) 0.58 (CC: 16.3) 0.24 (CC: 16.6)
L2 0.26 (CC: 5.6) 0.10 (CC: 1.8) 0.10 (CC: 40.2) 0.00 (CC: 21.2) 0.08 (CC: 6.0) 0.02 (CC: 0.9)

๐Ÿ”„ Distractor Performance

Task OpenVLA OpenVLA-OFT ฯ€โ‚€ ฯ€โ‚€-FAST UniVLA SmolVLA
StaticDistractors
L0 0.80 1.00 0.92 1.00 1.00 0.54
L1 0.20 0.00 0.02 0.22 0.12 0.00
L2 0.00 0.20 0.02 0.00 0.00 0.00
DynamicDistractors
L0 0.60 1.00 0.78 0.80 0.78 0.42
L1 0.58 0.54 0.70 0.28 0.54 0.30
L2 0.40 0.40 0.18 0.04 0.04 0.00

๐ŸŽฏ Extrapolation Performance

Task OpenVLA OpenVLA-OFT ฯ€โ‚€ ฯ€โ‚€-FAST UniVLA SmolVLA
PrepositionCombinations
L0 0.68 0.62 0.76 0.14 0.50 0.20
L1 0.04 0.18 0.10 0.00 0.02 0.00
L2 0.00 0.00 0.00 0.00 0.02 0.00
TaskWorkflows
L0 0.82 0.74 0.72 0.24 0.76 0.32
L1 0.20 0.00 0.00 0.00 0.04 0.04
L2 0.16 0.00 0.00 0.00 0.20 0.00
UnseenObjects
L0 0.80 0.60 0.80 0.00 0.34 0.16
L1 0.60 0.40 0.52 0.00 0.76 0.18
L2 0.00 0.20 0.04 0.00 0.16 0.00

๐Ÿ“ˆ Long Horizon Performance

Task OpenVLA OpenVLA-OFT ฯ€โ‚€ ฯ€โ‚€-FAST UniVLA SmolVLA
LongHorizon
L0 0.80 0.80 0.92 0.62 0.66 0.74
L1 0.00 0.00 0.02 0.00 0.00 0.00
L2 0.00 0.00 0.00 0.00 0.00 0.00

License

This project is licensed under the Apache 2.0 license - see LICENSE for details.

Acknowledgments

  • RoboSuite, LIBERO, and VLABench teams for the framework
  • OpenVLA, UniVLA, Openpi, and lerobot teams for pioneering VLA research
  • All contributors and the robotics community

VLA-Arena: Advancing Vision-Language-Action Models Through Comprehensive Evaluation
Made with โค๏ธ by the VLA-Arena Team

About

VLA-Arena is an open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published