Skip to content

Releases: allenai/vla-evaluation-harness

v0.1.0

05 Apr 15:13

Choose a tag to compare

We're excited to release v0.1.0 — featuring systematic reproduction of published VLA benchmark scores across 6 models and 3 benchmarks with a single unified harness.

🔬 Reproduction Results

6 VLA codebases reproduced across 3 benchmarks using only their public checkpoints and this harness — no benchmark-specific forks, no manual glue code.

Codebase LIBERO (%) CALVIN (len) SimplerEnv (%)
OpenVLA 76.2 (−0.3)
π₀.₅ 97.7 (+0.8)
OpenVLA-OFT 96.7 (−0.4)
GR00T N1.6 94.9 (−2.1)† 59.7 (−8.0)‡
DB-CogACT 94.7 (−0.2) 4.02 (−0.04) 63.5 (−6.0)
X-VLA 97.4 (−0.7) 4.30 (−0.13) 94.8 (−1.0)

Values show our score (Δ vs. reported). — = no public checkpoint. †Community checkpoint. ‡See reproduction notes.

Full per-task breakdowns and reproduction logs: docs/reproductions/.

DimSpec: Convention Validation at Startup (#19)

Every model server and benchmark now declares its action/observation format (dimensions, rotation convention, coordinate frame) via a DimSpec. The orchestrator cross-validates these specs during the HELLO handshake — catching mismatches before any GPU time is spent.

SimplerEnv Benchmark Rewrite (#25)

SimplerEnv has been rewritten to use simpler_env.make(task_name) with prepackaged_config, replacing the previous manual env setup. Image resize responsibility moved from benchmarks to model servers — benchmarks now send native resolution. 5 models reproduced after the rewrite.

New Benchmark

  • RoboMME (#12, @alohays) — 16 memory-augmented manipulation tasks across 4 cognitive suites (Counting, Permanence, Reference, Imitation). First benchmark contribution.

Model Server Fixes

  • X-VLA: rot6d convention for LIBERO, euler→axangle for SimplerEnv delta control, absolute EE control, base-relative EE pose computation
  • GR00T: accumulate_success + overlay removal, prepackaged_config, base-relative EE state, replace transforms3d with local rotation.py
  • starVLA: configurable unnormalization — minmax vs q99 (#23, @junhalee-sqzb); logging hijack fix (#17, @junhalee-sqzb)
  • DB-CogACT: SimplerEnv image resize regression fix

Orchestrator & Infrastructure

  • Error isolation: failure_reason + failure_detail (with traceback) on every episode error; infra errors excluded from success metrics
  • Filelock: shard result file collision prevention for parallel evaluations; try/finally lock release guarantee
  • Progress monitoring: .progress files for live shard tracking
  • Reconnect HELLO: re-establish protocol on implicit WebSocket reconnect
  • CLI: --output-dir override, JSON string parsing for list/dict arguments
  • Orchestrator refactor: deduplicate output paths, error handlers, and progress tracking

Leaderboard & Frontend

  • Mobile performance (#13): 50-row pagination + shared tooltip replaces 13,800 DOM elements — ~90% node reduction
  • Data update (#18): 17 benchmarks, 512 models, 661 results

Documentation

  • Structured reproduction results under docs/reproductions/
  • Cross-benchmark pipeline verification audit
  • Common reproduction pitfalls guide (unnormalization modes, internal forks, Docker rebuild)

Contributors

@MilkClouds, @alohays, @junhalee-sqzb

v0.0.3

22 Mar 12:27
7deb016

Choose a tag to compare

Release v0.0.3

Highlights

Critical Fix: Python 3.8 Benchmark Compatibility (#9)

functools.cache (Python 3.9+) has been replaced with functools.lru_cache, fixing a runtime crash that prevented all Python 3.8 benchmark Docker images from working in v0.0.2.

While v0.0.2 fixed the dict[str, Any] / TypeAlias import errors from v0.0.1, it inadvertently introduced a new Python 3.8 incompatibility via functools.cache. This release resolves the issue — all 6 affected benchmarks (LIBERO, CALVIN, RoboCerebra, RLBench) now run correctly on Python 3.8.

Important

If you are running Python 3.8 benchmarks, upgrade to v0.0.3. Both v0.0.1 and v0.0.2 are broken for these images.

Log of benchmark smoke test
$ vla-eval test --benchmark --parallel
vla-eval smoke tests
========================================

BENCHMARK
  ✓ libero_pro              success_rate=0%                                26.5s
  ✓ libero                  success_rate=0%                                27.3s
  ✓ calvin                  success_rate=0%                                21.3s
  ✓ libero_mem              success_rate=0%                                25.7s
  ✓ maniskill2              success_rate=0%                                14.7s
  ✓ simpler                 success_rate=0%                                33.0s
  ✓ robocasa                success_rate=0%                                25.1s
  ✓ mikasa                  success_rate=0%                                16.8s
  ✓ vlabench                success_rate=0%                                33.4s
  ✓ rlbench                 success_rate=0%                                11.7s
  ✓ robocerebra             success_rate=0%                                22.2s
  ✓ robotwin                completed (no result file)                     86.3s
  ✓ kinetix                 success_rate=0%                                58.1s

========================================
Results: 13 passed, 0 failed, 0 skipped    total: 402.1s

Benchmark Protocol Audit (#7)

All 17 benchmark protocols have been audited and corrected against their source papers:

  • MIKASA-Robo: restored the standard 5-task protocol with paper-verified scores
  • SimplerEnv: fixed data integrity issues and added missing Xiaomi-Robotics-0 results
  • Leaderboard data: sorted results.json by (benchmark, model) for consistency

Bug Fixes

  • #9 Replace functools.cache with lru_cache for Python 3.8 compatibility
  • #7 Audit and fix benchmark protocols across all 17 benchmarks
  • #6 Fix SimplerEnv data integrity + add missing Xiaomi-Robotics-0 results
  • #3 Fix unnorm actions for StarVLA (@junha-l)

Leaderboard & CI

  • Added GitHub labeler workflow for leaderboard file changes
  • CI now includes citations.json with weekly leaderboard sync
  • Added validate.py sort/format validation with --fix option (#5)
  • New leaderboard section in README with badge, dedicated README, and GitHub link

Docs & Infra

  • Added issue and pull request templates
  • Enhanced CONTRIBUTING.md with detailed contribution types
  • Added .gitignore and .python-version files

Contributors

@MilkClouds, @junha-l

v0.0.2

20 Mar 03:14

Choose a tag to compare

Release v0.0.2

Highlights

vla-eval test — Unified Smoke Test CLI

The three separate commands (validate, test-server, test-benchmark) have been replaced with a single vla-eval test entry point.

vla-eval test                # config validation only (fast, safe default)
vla-eval test --all          # run all categories
vla-eval test --server       # model server smoke tests only
vla-eval test --benchmark    # benchmark smoke tests only
vla-eval test --list         # show inventory + readiness
  • Parallel GPU execution: --parallel [N] with automatic GPU slot allocation
  • Graceful Ctrl+C: prints partial results + auto-saves stderr logs to results/smoke-logs/
  • Smoke step cap: max_steps=50 prevents timeouts on slow benchmarks
  • --fail-fast, --list readiness overview

Docker Dev Mode

All 13 benchmark Dockerfiles now use editable install (uv pip install -e .). The new --dev flag on vla-eval run bind-mounts host src/ into the container, enabling rapid iteration without image rebuilds.

vla-eval run --config configs/libero_smoke_test.yaml --dev

Rich CLI Output

Colored pass/fail/skip symbols, category headers, and status lines across all CLI output. Auto-disables when piped (NO_COLOR, isatty, TERM=dumb). rich is lazy-imported to keep CLI startup fast.

Bug Fixes

  • Python 3.8 compatibility: dict[str, Any]Dict[str, Any], TypeAlias guarded behind TYPE_CHECKING — fixes import errors in 6 benchmark Docker images (LIBERO, CALVIN, RoboCerebra, RLBench)
  • RLBench Dockerfile: extract BuildKit-only heredoc to standalone rlbench_entrypoint.sh
  • RLBench entrypoint: replace sleep with Xvfb socket polling for reliable startup

CI & Leaderboard

  • Merged redundant sync-external.yml into update-data.yml
  • Added --source filter input to update-data workflow dispatch
  • Fixed unquoted ${{ }} YAML expressions that broke workflow parsing
  • All leaderboard scripts now write structured summaries to $GITHUB_OUTPUT for informative PR bodies
  • Synced external leaderboard scores (RoboChallenge + RoboArena)

Breaking Changes

  • vla-eval validate, vla-eval test-server, vla-eval test-benchmark have been removed. Use vla-eval test instead.

Contributors

@MilkClouds

v0.0.1

20 Mar 02:23

Choose a tag to compare

Release v0.0.1: Initial Release

vla-evaluation-harness — One framework to evaluate any VLA model on any robot simulation benchmark.


Caution

Python 3.8 benchmark images are broken in this release. Use v0.0.2 instead.

types.py uses dict[str, Any] (PEP 585) and unguarded TypeAlias, which are not available in Python 3.8. This causes an immediate import error in 6 out of 14 benchmark Docker images that run Python 3.8:

Affected Benchmarks Docker Image Python
LIBERO (Spatial / Goal / Object / 10 / 90) 3.8
LIBERO-Pro 3.8
LIBERO-Mem 3.8
CALVIN 3.8
RoboCerebra 3.8
RLBench 3.8

Benchmarks on Python 3.10+ (SimplerEnv, ManiSkill2, VLABench, RoboTwin, RoboCasa, MIKASA-Robo) and 3.11 (Kinetix) are unaffected.

Fixed in v0.0.2 via PR #2.


Highlights

This is the first public release of vla-eval, a unified evaluation framework for Vision-Language-Action (VLA) models. Integrate a model once, integrate a benchmark once — the full cross-evaluation matrix fills itself.

Complete Decoupling of Models and Benchmarks

  • Benchmarks run inside Docker containers with pinned environments.
  • Model servers run on the host with GPU access as standalone uv scripts — zero manual setup.
  • The two communicate over WebSocket + msgpack — a compact binary protocol with numpy array encoding and security-hardened deserialization.

No more private eval forks per benchmark. Model code never touches benchmark dependencies and vice versa.

Model Server (host, GPU)  ←── WebSocket/msgpack ──→  Benchmark (Docker container)

47x Throughput via Batch Parallel Evaluation

speedup_comparison

Two axes of parallelism that multiply together:

Sequential Batch Parallel (50 shards, B=16)
Wall-clock (2000 LIBERO episodes) ~14 h ~18 min
Throughput ~11 obs/s ~486 obs/s
  • Episode sharding: split (task, episode) pairs across N independent OS processes via round-robin.
  • Batch GPU inference: coalesce observations from multiple shards into a single forward pass via BatchPredictModelServer.

See the Tuning Guide and included benchmarking tools (experiments/bench_demand.py, experiments/bench_supply.py) for finding the optimal operating point.

VLA Leaderboard

leaderboard

657 published results across 17 benchmarks and 509 models, curated from published papers.

Browse: allenai.github.io/vla-evaluation-harness/leaderboard

The leaderboard aggregates evaluation scores reported in VLA papers into a single, filterable view. Each entry is traced to its source paper and table. Benchmarks beyond the 14 supported by the harness (e.g. RoboArena, RoboChallenge, RoboTwin v1) are included for completeness.


Quick Start

pip install vla-eval

Two terminals — one for the model server (GPU), one for the benchmark (Docker):

# Terminal 1 — model server (runs on host with GPU)
vla-eval serve --config configs/model_servers/dexbotic_cogact_libero.yaml

# Terminal 2 — run evaluation (benchmark runs in Docker by default)
vla-eval run --config configs/libero_smoke_test.yaml

Results are saved to results/ as JSON. For full evaluation (10 tasks × 50 episodes):

vla-eval run --config configs/libero_spatial.yaml

For parallel evaluation (47x speedup):

# Launches 50 shards + auto-merges results
./scripts/run_sharded.sh -c configs/libero_spatial.yaml -n 50

Supported Benchmarks (14)

Benchmark Docker Image Size Python Notes
LIBERO (Spatial, Goal, Object, 10, 90) libero 6.0 GB 3.8 5 task suites
LIBERO-Pro libero-pro 6.2 GB 3.8 Perturbation-based robustness evaluation (5 axes)
LIBERO-Mem libero-mem 11.3 GB 3.8 Memory-dependent, non-Markovian tasks
CALVIN calvin 9.5 GB 3.8 1000 chained 5-subtask sequences (ABC→D)
SimplerEnv simpler 4.9 GB 3.10 Sim-to-real transfer via ManiSkill2
ManiSkill2 maniskill2 9.8 GB 3.10 Pick, stack, cluttered grasping (5 tasks)
RoboCasa robocasa 35.6 GB 3.11 Kitchen manipulation on robosuite v2 / MuJoCo (365 tasks)
RoboTwin 2.0 robotwin 28.6 GB 3.10 Dual-arm manipulation on SAPIEN/CuRobo
RoboCerebra robocerebra 6.3 GB 3.8 Long-horizon manipulation on LIBERO/robosuite
MIKASA-Robo mikasa-robo 10.1 GB 3.10 Memory-intensive manipulation on ManiSkill3/SAPIEN (32 tasks)
VLABench vlabench 17.7 GB 3.10 Language-conditioned long-horizon reasoning on dm_control
Kinetix kinetix 9.5 GB 3.11 JAX-based 2D dynamic tasks (throw, catch, balance, locomotion)
RLBench rlbench 4.7 GB 3.8 CoppeliaSim/PyRep manipulation

All images: ghcr.io/allenai/vla-evaluation-harness/<name>:latest


Supported Model Servers

Official (8)

Model Description
OpenVLA Open-source VLA baseline
π₀ Flow-matching policy via OpenPI
π₀-FAST Fast variant of π₀ (shared server)
GR00T N1.6 NVIDIA Isaac-GR00T 3B foundation model
OFT OpenVLA fine-tuned with action chunking + parallel decoding
X-VLA Flow-matching VLA inference via HuggingFace
CogACT CogACT action-generation model (Microsoft)
RTC Real-Time Chunking diffusion policy for Kinetix

Community

Model Maintainer
DB-CogACT dexbotic
QwenGR00T, QwenOFT, QwenPI, QwenFAST starVLA

All model servers support configurable action chunking (newest / average / EMA ensemble), batch inference (max_batch_size, max_wait_time), and automatic reconnection with exponential backoff.


CLI

Command Description
vla-eval run Run evaluation (Docker by default, --no-docker for local dev)
vla-eval serve Launch model server via uv run <script>
vla-eval merge Merge sharded result files (episode-level deduplication)
vla-eval validate Validate config import paths resolve to Benchmark subclasses
vla-eval test-benchmark Smoke-test a benchmark Docker image (EchoModelServer, 1 episode — no GPU needed)
vla-eval test-server Smoke-test a model server (StubBenchmark, 3 steps — no Docker needed)

Key run flags: --shard-id / --num-shards (episode sharding), --gpus / --cpus (resource allocation), --no-docker (local dev), --yes (skip prompts).


Contributors

Core contributor: @MilkClouds


Citation

@article{choi2026vlaeval,
  title={vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models},
  author={Choi, Suhwan and Lee, Yunsung and Park, Yubeen and Kim, Chris Dongjoo and Krishna, Ranjay and Fox, Dieter and Yu, Youngjae},
  journal={arXiv preprint arXiv:2603.13966},
  year={2026}
}