Releases: allenai/vla-evaluation-harness
v0.1.0
We're excited to release v0.1.0 — featuring systematic reproduction of published VLA benchmark scores across 6 models and 3 benchmarks with a single unified harness.
🔬 Reproduction Results
6 VLA codebases reproduced across 3 benchmarks using only their public checkpoints and this harness — no benchmark-specific forks, no manual glue code.
| Codebase | LIBERO (%) | CALVIN (len) | SimplerEnv (%) |
|---|---|---|---|
| OpenVLA | 76.2 (−0.3) | — | — |
| π₀.₅ | 97.7 (+0.8) | — | — |
| OpenVLA-OFT | 96.7 (−0.4) | — | — |
| GR00T N1.6 | 94.9 (−2.1)† | — | 59.7 (−8.0)‡ |
| DB-CogACT | 94.7 (−0.2) | 4.02 (−0.04) | 63.5 (−6.0) |
| X-VLA | 97.4 (−0.7) | 4.30 (−0.13) | 94.8 (−1.0) |
Values show our score (Δ vs. reported). — = no public checkpoint. †Community checkpoint. ‡See reproduction notes.
Full per-task breakdowns and reproduction logs: docs/reproductions/.
DimSpec: Convention Validation at Startup (#19)
Every model server and benchmark now declares its action/observation format (dimensions, rotation convention, coordinate frame) via a DimSpec. The orchestrator cross-validates these specs during the HELLO handshake — catching mismatches before any GPU time is spent.
SimplerEnv Benchmark Rewrite (#25)
SimplerEnv has been rewritten to use simpler_env.make(task_name) with prepackaged_config, replacing the previous manual env setup. Image resize responsibility moved from benchmarks to model servers — benchmarks now send native resolution. 5 models reproduced after the rewrite.
New Benchmark
- RoboMME (#12, @alohays) — 16 memory-augmented manipulation tasks across 4 cognitive suites (Counting, Permanence, Reference, Imitation). First benchmark contribution.
Model Server Fixes
- X-VLA: rot6d convention for LIBERO, euler→axangle for SimplerEnv delta control, absolute EE control, base-relative EE pose computation
- GR00T:
accumulate_success+ overlay removal,prepackaged_config, base-relative EE state, replacetransforms3dwith localrotation.py - starVLA: configurable unnormalization —
minmaxvsq99(#23, @junhalee-sqzb); logging hijack fix (#17, @junhalee-sqzb) - DB-CogACT: SimplerEnv image resize regression fix
Orchestrator & Infrastructure
- Error isolation:
failure_reason+failure_detail(with traceback) on every episode error; infra errors excluded from success metrics - Filelock: shard result file collision prevention for parallel evaluations;
try/finallylock release guarantee - Progress monitoring:
.progressfiles for live shard tracking - Reconnect HELLO: re-establish protocol on implicit WebSocket reconnect
- CLI:
--output-diroverride, JSON string parsing for list/dict arguments - Orchestrator refactor: deduplicate output paths, error handlers, and progress tracking
Leaderboard & Frontend
- Mobile performance (#13): 50-row pagination + shared tooltip replaces 13,800 DOM elements — ~90% node reduction
- Data update (#18): 17 benchmarks, 512 models, 661 results
Documentation
- Structured reproduction results under
docs/reproductions/ - Cross-benchmark pipeline verification audit
- Common reproduction pitfalls guide (unnormalization modes, internal forks, Docker rebuild)
Contributors
v0.0.3
Release v0.0.3
Highlights
Critical Fix: Python 3.8 Benchmark Compatibility (#9)
functools.cache (Python 3.9+) has been replaced with functools.lru_cache, fixing a runtime crash that prevented all Python 3.8 benchmark Docker images from working in v0.0.2.
While v0.0.2 fixed the dict[str, Any] / TypeAlias import errors from v0.0.1, it inadvertently introduced a new Python 3.8 incompatibility via functools.cache. This release resolves the issue — all 6 affected benchmarks (LIBERO, CALVIN, RoboCerebra, RLBench) now run correctly on Python 3.8.
Important
If you are running Python 3.8 benchmarks, upgrade to v0.0.3. Both v0.0.1 and v0.0.2 are broken for these images.
Log of benchmark smoke test
$ vla-eval test --benchmark --parallel
vla-eval smoke tests
========================================
BENCHMARK
✓ libero_pro success_rate=0% 26.5s
✓ libero success_rate=0% 27.3s
✓ calvin success_rate=0% 21.3s
✓ libero_mem success_rate=0% 25.7s
✓ maniskill2 success_rate=0% 14.7s
✓ simpler success_rate=0% 33.0s
✓ robocasa success_rate=0% 25.1s
✓ mikasa success_rate=0% 16.8s
✓ vlabench success_rate=0% 33.4s
✓ rlbench success_rate=0% 11.7s
✓ robocerebra success_rate=0% 22.2s
✓ robotwin completed (no result file) 86.3s
✓ kinetix success_rate=0% 58.1s
========================================
Results: 13 passed, 0 failed, 0 skipped total: 402.1sBenchmark Protocol Audit (#7)
All 17 benchmark protocols have been audited and corrected against their source papers:
- MIKASA-Robo: restored the standard 5-task protocol with paper-verified scores
- SimplerEnv: fixed data integrity issues and added missing Xiaomi-Robotics-0 results
- Leaderboard data: sorted
results.jsonby (benchmark, model) for consistency
Bug Fixes
- #9 Replace
functools.cachewithlru_cachefor Python 3.8 compatibility - #7 Audit and fix benchmark protocols across all 17 benchmarks
- #6 Fix SimplerEnv data integrity + add missing Xiaomi-Robotics-0 results
- #3 Fix unnorm actions for StarVLA (@junha-l)
Leaderboard & CI
- Added GitHub labeler workflow for leaderboard file changes
- CI now includes
citations.jsonwith weekly leaderboard sync - Added
validate.pysort/format validation with--fixoption (#5) - New leaderboard section in README with badge, dedicated README, and GitHub link
Docs & Infra
- Added issue and pull request templates
- Enhanced CONTRIBUTING.md with detailed contribution types
- Added
.gitignoreand.python-versionfiles
Contributors
v0.0.2
Release v0.0.2
Highlights
vla-eval test — Unified Smoke Test CLI
The three separate commands (validate, test-server, test-benchmark) have been replaced with a single vla-eval test entry point.
vla-eval test # config validation only (fast, safe default)
vla-eval test --all # run all categories
vla-eval test --server # model server smoke tests only
vla-eval test --benchmark # benchmark smoke tests only
vla-eval test --list # show inventory + readiness- Parallel GPU execution:
--parallel [N]with automatic GPU slot allocation - Graceful Ctrl+C: prints partial results + auto-saves stderr logs to
results/smoke-logs/ - Smoke step cap:
max_steps=50prevents timeouts on slow benchmarks --fail-fast,--listreadiness overview
Docker Dev Mode
All 13 benchmark Dockerfiles now use editable install (uv pip install -e .). The new --dev flag on vla-eval run bind-mounts host src/ into the container, enabling rapid iteration without image rebuilds.
vla-eval run --config configs/libero_smoke_test.yaml --devRich CLI Output
Colored pass/fail/skip symbols, category headers, and status lines across all CLI output. Auto-disables when piped (NO_COLOR, isatty, TERM=dumb). rich is lazy-imported to keep CLI startup fast.
Bug Fixes
- Python 3.8 compatibility:
dict[str, Any]→Dict[str, Any],TypeAliasguarded behindTYPE_CHECKING— fixes import errors in 6 benchmark Docker images (LIBERO, CALVIN, RoboCerebra, RLBench) - RLBench Dockerfile: extract BuildKit-only heredoc to standalone
rlbench_entrypoint.sh - RLBench entrypoint: replace
sleepwith Xvfb socket polling for reliable startup
CI & Leaderboard
- Merged redundant
sync-external.ymlintoupdate-data.yml - Added
--sourcefilter input toupdate-dataworkflow dispatch - Fixed unquoted
${{ }}YAML expressions that broke workflow parsing - All leaderboard scripts now write structured summaries to
$GITHUB_OUTPUTfor informative PR bodies - Synced external leaderboard scores (RoboChallenge + RoboArena)
Breaking Changes
vla-eval validate,vla-eval test-server,vla-eval test-benchmarkhave been removed. Usevla-eval testinstead.
Contributors
v0.0.1
Release v0.0.1: Initial Release
vla-evaluation-harness — One framework to evaluate any VLA model on any robot simulation benchmark.
Caution
Python 3.8 benchmark images are broken in this release. Use v0.0.2 instead.
types.py uses dict[str, Any] (PEP 585) and unguarded TypeAlias, which are not available in Python 3.8. This causes an immediate import error in 6 out of 14 benchmark Docker images that run Python 3.8:
| Affected Benchmarks | Docker Image Python |
|---|---|
| LIBERO (Spatial / Goal / Object / 10 / 90) | 3.8 |
| LIBERO-Pro | 3.8 |
| LIBERO-Mem | 3.8 |
| CALVIN | 3.8 |
| RoboCerebra | 3.8 |
| RLBench | 3.8 |
Benchmarks on Python 3.10+ (SimplerEnv, ManiSkill2, VLABench, RoboTwin, RoboCasa, MIKASA-Robo) and 3.11 (Kinetix) are unaffected.
Highlights
This is the first public release of vla-eval, a unified evaluation framework for Vision-Language-Action (VLA) models. Integrate a model once, integrate a benchmark once — the full cross-evaluation matrix fills itself.
Complete Decoupling of Models and Benchmarks
- Benchmarks run inside Docker containers with pinned environments.
- Model servers run on the host with GPU access as standalone uv scripts — zero manual setup.
- The two communicate over WebSocket + msgpack — a compact binary protocol with numpy array encoding and security-hardened deserialization.
No more private eval forks per benchmark. Model code never touches benchmark dependencies and vice versa.
Model Server (host, GPU) ←── WebSocket/msgpack ──→ Benchmark (Docker container)
47x Throughput via Batch Parallel Evaluation
Two axes of parallelism that multiply together:
| Sequential | Batch Parallel (50 shards, B=16) | |
|---|---|---|
| Wall-clock (2000 LIBERO episodes) | ~14 h | ~18 min |
| Throughput | ~11 obs/s | ~486 obs/s |
- Episode sharding: split
(task, episode)pairs across N independent OS processes via round-robin. - Batch GPU inference: coalesce observations from multiple shards into a single forward pass via
BatchPredictModelServer.
See the Tuning Guide and included benchmarking tools (experiments/bench_demand.py, experiments/bench_supply.py) for finding the optimal operating point.
VLA Leaderboard
657 published results across 17 benchmarks and 509 models, curated from published papers.
Browse: allenai.github.io/vla-evaluation-harness/leaderboard
The leaderboard aggregates evaluation scores reported in VLA papers into a single, filterable view. Each entry is traced to its source paper and table. Benchmarks beyond the 14 supported by the harness (e.g. RoboArena, RoboChallenge, RoboTwin v1) are included for completeness.
Quick Start
pip install vla-evalTwo terminals — one for the model server (GPU), one for the benchmark (Docker):
# Terminal 1 — model server (runs on host with GPU)
vla-eval serve --config configs/model_servers/dexbotic_cogact_libero.yaml
# Terminal 2 — run evaluation (benchmark runs in Docker by default)
vla-eval run --config configs/libero_smoke_test.yamlResults are saved to results/ as JSON. For full evaluation (10 tasks × 50 episodes):
vla-eval run --config configs/libero_spatial.yamlFor parallel evaluation (47x speedup):
# Launches 50 shards + auto-merges results
./scripts/run_sharded.sh -c configs/libero_spatial.yaml -n 50Supported Benchmarks (14)
| Benchmark | Docker Image | Size | Python | Notes |
|---|---|---|---|---|
| LIBERO (Spatial, Goal, Object, 10, 90) | libero |
6.0 GB | 3.8 | 5 task suites |
| LIBERO-Pro | libero-pro |
6.2 GB | 3.8 | Perturbation-based robustness evaluation (5 axes) |
| LIBERO-Mem | libero-mem |
11.3 GB | 3.8 | Memory-dependent, non-Markovian tasks |
| CALVIN | calvin |
9.5 GB | 3.8 | 1000 chained 5-subtask sequences (ABC→D) |
| SimplerEnv | simpler |
4.9 GB | 3.10 | Sim-to-real transfer via ManiSkill2 |
| ManiSkill2 | maniskill2 |
9.8 GB | 3.10 | Pick, stack, cluttered grasping (5 tasks) |
| RoboCasa | robocasa |
35.6 GB | 3.11 | Kitchen manipulation on robosuite v2 / MuJoCo (365 tasks) |
| RoboTwin 2.0 | robotwin |
28.6 GB | 3.10 | Dual-arm manipulation on SAPIEN/CuRobo |
| RoboCerebra | robocerebra |
6.3 GB | 3.8 | Long-horizon manipulation on LIBERO/robosuite |
| MIKASA-Robo | mikasa-robo |
10.1 GB | 3.10 | Memory-intensive manipulation on ManiSkill3/SAPIEN (32 tasks) |
| VLABench | vlabench |
17.7 GB | 3.10 | Language-conditioned long-horizon reasoning on dm_control |
| Kinetix | kinetix |
9.5 GB | 3.11 | JAX-based 2D dynamic tasks (throw, catch, balance, locomotion) |
| RLBench | rlbench |
4.7 GB | 3.8 | CoppeliaSim/PyRep manipulation |
All images: ghcr.io/allenai/vla-evaluation-harness/<name>:latest
Supported Model Servers
Official (8)
| Model | Description |
|---|---|
| OpenVLA | Open-source VLA baseline |
| π₀ | Flow-matching policy via OpenPI |
| π₀-FAST | Fast variant of π₀ (shared server) |
| GR00T N1.6 | NVIDIA Isaac-GR00T 3B foundation model |
| OFT | OpenVLA fine-tuned with action chunking + parallel decoding |
| X-VLA | Flow-matching VLA inference via HuggingFace |
| CogACT | CogACT action-generation model (Microsoft) |
| RTC | Real-Time Chunking diffusion policy for Kinetix |
Community
| Model | Maintainer |
|---|---|
| DB-CogACT | dexbotic |
| QwenGR00T, QwenOFT, QwenPI, QwenFAST | starVLA |
All model servers support configurable action chunking (newest / average / EMA ensemble), batch inference (max_batch_size, max_wait_time), and automatic reconnection with exponential backoff.
CLI
| Command | Description |
|---|---|
vla-eval run |
Run evaluation (Docker by default, --no-docker for local dev) |
vla-eval serve |
Launch model server via uv run <script> |
vla-eval merge |
Merge sharded result files (episode-level deduplication) |
vla-eval validate |
Validate config import paths resolve to Benchmark subclasses |
vla-eval test-benchmark |
Smoke-test a benchmark Docker image (EchoModelServer, 1 episode — no GPU needed) |
vla-eval test-server |
Smoke-test a model server (StubBenchmark, 3 steps — no Docker needed) |
Key run flags: --shard-id / --num-shards (episode sharding), --gpus / --cpus (resource allocation), --no-docker (local dev), --yes (skip prompts).
Contributors
Core contributor: @MilkClouds
Citation
@article{choi2026vlaeval,
title={vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models},
author={Choi, Suhwan and Lee, Yunsung and Park, Yubeen and Kim, Chris Dongjoo and Krishna, Ranjay and Fox, Dieter and Yu, Youngjae},
journal={arXiv preprint arXiv:2603.13966},
year={2026}
}