05 Apr 15:13

2812d8e

v0.1.0 Latest

Latest

We're excited to release v0.1.0 — featuring systematic reproduction of published VLA benchmark scores across 6 models and 3 benchmarks with a single unified harness.

🔬 Reproduction Results

6 VLA codebases reproduced across 3 benchmarks using only their public checkpoints and this harness — no benchmark-specific forks, no manual glue code.

Codebase	LIBERO (%)	CALVIN (len)	SimplerEnv (%)
OpenVLA	76.2 (−0.3)	—	—
π₀.₅	97.7 (+0.8)	—	—
OpenVLA-OFT	96.7 (−0.4)	—	—
GR00T N1.6	94.9 (−2.1)†	—	59.7 (−8.0)‡
DB-CogACT	94.7 (−0.2)	4.02 (−0.04)	63.5 (−6.0)
X-VLA	97.4 (−0.7)	4.30 (−0.13)	94.8 (−1.0)

Values show our score (Δ vs. reported). — = no public checkpoint. †Community checkpoint. ‡See reproduction notes.

Full per-task breakdowns and reproduction logs: docs/reproductions/.

DimSpec: Convention Validation at Startup (#19)

Every model server and benchmark now declares its action/observation format (dimensions, rotation convention, coordinate frame) via a DimSpec. The orchestrator cross-validates these specs during the HELLO handshake — catching mismatches before any GPU time is spent.

SimplerEnv Benchmark Rewrite (#25)

SimplerEnv has been rewritten to use simpler_env.make(task_name) with prepackaged_config, replacing the previous manual env setup. Image resize responsibility moved from benchmarks to model servers — benchmarks now send native resolution. 5 models reproduced after the rewrite.

New Benchmark

RoboMME (#12, @alohays) — 16 memory-augmented manipulation tasks across 4 cognitive suites (Counting, Permanence, Reference, Imitation). First benchmark contribution.

Model Server Fixes

X-VLA: rot6d convention for LIBERO, euler→axangle for SimplerEnv delta control, absolute EE control, base-relative EE pose computation
GR00T: accumulate_success + overlay removal, prepackaged_config, base-relative EE state, replace transforms3d with local rotation.py
starVLA: configurable unnormalization — minmax vs q99 (#23, @junhalee-sqzb); logging hijack fix (#17, @junhalee-sqzb)
DB-CogACT: SimplerEnv image resize regression fix

Orchestrator & Infrastructure

Error isolation: failure_reason + failure_detail (with traceback) on every episode error; infra errors excluded from success metrics
Filelock: shard result file collision prevention for parallel evaluations; try/finally lock release guarantee
Progress monitoring: .progress files for live shard tracking
Reconnect HELLO: re-establish protocol on implicit WebSocket reconnect
CLI: --output-dir override, JSON string parsing for list/dict arguments
Orchestrator refactor: deduplicate output paths, error handlers, and progress tracking

Leaderboard & Frontend

Mobile performance (#13): 50-row pagination + shared tooltip replaces 13,800 DOM elements — ~90% node reduction
Data update (#18): 17 benchmarks, 512 models, 661 results

Documentation

Structured reproduction results under docs/reproductions/
Cross-benchmark pipeline verification audit
Common reproduction pitfalls guide (unnormalization modes, internal forks, Docker rebuild)

Contributors

@MilkClouds, @alohays, @junhalee-sqzb

Assets 2

22 Mar 12:27

MilkClouds

v0.0.3

7deb016

v0.0.3

Release v0.0.3

Highlights

Critical Fix: Python 3.8 Benchmark Compatibility (#9)

functools.cache (Python 3.9+) has been replaced with functools.lru_cache, fixing a runtime crash that prevented all Python 3.8 benchmark Docker images from working in v0.0.2.

While v0.0.2 fixed the dict[str, Any] / TypeAlias import errors from v0.0.1, it inadvertently introduced a new Python 3.8 incompatibility via functools.cache. This release resolves the issue — all 6 affected benchmarks (LIBERO, CALVIN, RoboCerebra, RLBench) now run correctly on Python 3.8.

Important

If you are running Python 3.8 benchmarks, upgrade to v0.0.3. Both v0.0.1 and v0.0.2 are broken for these images.

Log of benchmark smoke test

$ vla-eval test --benchmark --parallel
vla-eval smoke tests
========================================

BENCHMARK
  ✓ libero_pro              success_rate=0%                                26.5s
  ✓ libero                  success_rate=0%                                27.3s
  ✓ calvin                  success_rate=0%                                21.3s
  ✓ libero_mem              success_rate=0%                                25.7s
  ✓ maniskill2              success_rate=0%                                14.7s
  ✓ simpler                 success_rate=0%                                33.0s
  ✓ robocasa                success_rate=0%                                25.1s
  ✓ mikasa                  success_rate=0%                                16.8s
  ✓ vlabench                success_rate=0%                                33.4s
  ✓ rlbench                 success_rate=0%                                11.7s
  ✓ robocerebra             success_rate=0%                                22.2s
  ✓ robotwin                completed (no result file)                     86.3s
  ✓ kinetix                 success_rate=0%                                58.1s

========================================
Results: 13 passed, 0 failed, 0 skipped    total: 402.1s

Benchmark Protocol Audit (#7)

All 17 benchmark protocols have been audited and corrected against their source papers:

MIKASA-Robo: restored the standard 5-task protocol with paper-verified scores
SimplerEnv: fixed data integrity issues and added missing Xiaomi-Robotics-0 results
Leaderboard data: sorted results.json by (benchmark, model) for consistency

Bug Fixes

#9 Replace functools.cache with lru_cache for Python 3.8 compatibility
#7 Audit and fix benchmark protocols across all 17 benchmarks
#6 Fix SimplerEnv data integrity + add missing Xiaomi-Robotics-0 results
#3 Fix unnorm actions for StarVLA (@junha-l)

Leaderboard & CI

Added GitHub labeler workflow for leaderboard file changes
CI now includes citations.json with weekly leaderboard sync
Added validate.py sort/format validation with --fix option (#5)
New leaderboard section in README with badge, dedicated README, and GitHub link

Docs & Infra

Added issue and pull request templates
Enhanced CONTRIBUTING.md with detailed contribution types
Added .gitignore and .python-version files

Contributors

@MilkClouds, @junha-l

Assets 2

20 Mar 03:14

MilkClouds

v0.0.2

3da4dd8

v0.0.2

Release v0.0.2

Highlights

`vla-eval test` — Unified Smoke Test CLI

The three separate commands (validate, test-server, test-benchmark) have been replaced with a single vla-eval test entry point.

vla-eval test                # config validation only (fast, safe default)
vla-eval test --all          # run all categories
vla-eval test --server       # model server smoke tests only
vla-eval test --benchmark    # benchmark smoke tests only
vla-eval test --list         # show inventory + readiness

Parallel GPU execution: --parallel [N] with automatic GPU slot allocation
Graceful Ctrl+C: prints partial results + auto-saves stderr logs to results/smoke-logs/
Smoke step cap: max_steps=50 prevents timeouts on slow benchmarks
--fail-fast, --list readiness overview

Docker Dev Mode

All 13 benchmark Dockerfiles now use editable install (uv pip install -e .). The new --dev flag on vla-eval run bind-mounts host src/ into the container, enabling rapid iteration without image rebuilds.

vla-eval run --config configs/libero_smoke_test.yaml --dev

Rich CLI Output

Colored pass/fail/skip symbols, category headers, and status lines across all CLI output. Auto-disables when piped (NO_COLOR, isatty, TERM=dumb). rich is lazy-imported to keep CLI startup fast.

Bug Fixes

Python 3.8 compatibility: dict[str, Any] → Dict[str, Any], TypeAlias guarded behind TYPE_CHECKING — fixes import errors in 6 benchmark Docker images (LIBERO, CALVIN, RoboCerebra, RLBench)
RLBench Dockerfile: extract BuildKit-only heredoc to standalone rlbench_entrypoint.sh
RLBench entrypoint: replace sleep with Xvfb socket polling for reliable startup

CI & Leaderboard

Merged redundant sync-external.yml into update-data.yml
Added --source filter input to update-data workflow dispatch
Fixed unquoted ${{ }} YAML expressions that broke workflow parsing
All leaderboard scripts now write structured summaries to $GITHUB_OUTPUT for informative PR bodies
Synced external leaderboard scores (RoboChallenge + RoboArena)

Breaking Changes

vla-eval validate, vla-eval test-server, vla-eval test-benchmark have been removed. Use vla-eval test instead.

Contributors

@MilkClouds

Assets 2

20 Mar 02:23

MilkClouds

v0.0.1

5ea6e51

v0.0.1

Release v0.0.1: Initial Release

vla-evaluation-harness — One framework to evaluate any VLA model on any robot simulation benchmark.

Caution

Python 3.8 benchmark images are broken in this release. Use v0.0.2 instead.

types.py uses dict[str, Any] (PEP 585) and unguarded TypeAlias, which are not available in Python 3.8. This causes an immediate import error in 6 out of 14 benchmark Docker images that run Python 3.8:

Affected Benchmarks	Docker Image Python
LIBERO (Spatial / Goal / Object / 10 / 90)	3.8
LIBERO-Pro	3.8
LIBERO-Mem	3.8
CALVIN	3.8
RoboCerebra	3.8
RLBench	3.8

Benchmarks on Python 3.10+ (SimplerEnv, ManiSkill2, VLABench, RoboTwin, RoboCasa, MIKASA-Robo) and 3.11 (Kinetix) are unaffected.

Fixed in v0.0.2 via PR #2.

Highlights

This is the first public release of vla-eval, a unified evaluation framework for Vision-Language-Action (VLA) models. Integrate a model once, integrate a benchmark once — the full cross-evaluation matrix fills itself.

Complete Decoupling of Models and Benchmarks

Benchmarks run inside Docker containers with pinned environments.
Model servers run on the host with GPU access as standalone uv scripts — zero manual setup.
The two communicate over WebSocket + msgpack — a compact binary protocol with numpy array encoding and security-hardened deserialization.

No more private eval forks per benchmark. Model code never touches benchmark dependencies and vice versa.

Model Server (host, GPU)  ←── WebSocket/msgpack ──→  Benchmark (Docker container)

47x Throughput via Batch Parallel Evaluation

Two axes of parallelism that multiply together:

	Sequential	Batch Parallel (50 shards, B=16)
Wall-clock (2000 LIBERO episodes)	~14 h	~18 min
Throughput	~11 obs/s	~486 obs/s

Episode sharding: split (task, episode) pairs across N independent OS processes via round-robin.
Batch GPU inference: coalesce observations from multiple shards into a single forward pass via BatchPredictModelServer.

See the Tuning Guide and included benchmarking tools (experiments/bench_demand.py, experiments/bench_supply.py) for finding the optimal operating point.

VLA Leaderboard

657 published results across 17 benchmarks and 509 models, curated from published papers.

Browse: allenai.github.io/vla-evaluation-harness/leaderboard

The leaderboard aggregates evaluation scores reported in VLA papers into a single, filterable view. Each entry is traced to its source paper and table. Benchmarks beyond the 14 supported by the harness (e.g. RoboArena, RoboChallenge, RoboTwin v1) are included for completeness.

Quick Start

pip install vla-eval

Two terminals — one for the model server (GPU), one for the benchmark (Docker):

# Terminal 1 — model server (runs on host with GPU)
vla-eval serve --config configs/model_servers/dexbotic_cogact_libero.yaml

# Terminal 2 — run evaluation (benchmark runs in Docker by default)
vla-eval run --config configs/libero_smoke_test.yaml

Results are saved to results/ as JSON. For full evaluation (10 tasks × 50 episodes):

vla-eval run --config configs/libero_spatial.yaml

For parallel evaluation (47x speedup):

# Launches 50 shards + auto-merges results
./scripts/run_sharded.sh -c configs/libero_spatial.yaml -n 50

Supported Benchmarks (14)

Benchmark	Docker Image	Size	Python	Notes
LIBERO (Spatial, Goal, Object, 10, 90)	`libero`	6.0 GB	3.8	5 task suites
LIBERO-Pro	`libero-pro`	6.2 GB	3.8	Perturbation-based robustness evaluation (5 axes)
LIBERO-Mem	`libero-mem`	11.3 GB	3.8	Memory-dependent, non-Markovian tasks
CALVIN	`calvin`	9.5 GB	3.8	1000 chained 5-subtask sequences (ABC→D)
SimplerEnv	`simpler`	4.9 GB	3.10	Sim-to-real transfer via ManiSkill2
ManiSkill2	`maniskill2`	9.8 GB	3.10	Pick, stack, cluttered grasping (5 tasks)
RoboCasa	`robocasa`	35.6 GB	3.11	Kitchen manipulation on robosuite v2 / MuJoCo (365 tasks)
RoboTwin 2.0	`robotwin`	28.6 GB	3.10	Dual-arm manipulation on SAPIEN/CuRobo
RoboCerebra	`robocerebra`	6.3 GB	3.8	Long-horizon manipulation on LIBERO/robosuite
MIKASA-Robo	`mikasa-robo`	10.1 GB	3.10	Memory-intensive manipulation on ManiSkill3/SAPIEN (32 tasks)
VLABench	`vlabench`	17.7 GB	3.10	Language-conditioned long-horizon reasoning on dm_control
Kinetix	`kinetix`	9.5 GB	3.11	JAX-based 2D dynamic tasks (throw, catch, balance, locomotion)
RLBench	`rlbench`	4.7 GB	3.8	CoppeliaSim/PyRep manipulation

All images: ghcr.io/allenai/vla-evaluation-harness/<name>:latest

Supported Model Servers

Official (8)

Model	Description
OpenVLA	Open-source VLA baseline
π₀	Flow-matching policy via OpenPI
π₀-FAST	Fast variant of π₀ (shared server)
GR00T N1.6	NVIDIA Isaac-GR00T 3B foundation model
OFT	OpenVLA fine-tuned with action chunking + parallel decoding
X-VLA	Flow-matching VLA inference via HuggingFace
CogACT	CogACT action-generation model (Microsoft)
RTC	Real-Time Chunking diffusion policy for Kinetix

Community

Model	Maintainer
DB-CogACT	dexbotic
QwenGR00T, QwenOFT, QwenPI, QwenFAST	starVLA

All model servers support configurable action chunking (newest / average / EMA ensemble), batch inference (max_batch_size, max_wait_time), and automatic reconnection with exponential backoff.

CLI

Command	Description
`vla-eval run`	Run evaluation (Docker by default, `--no-docker` for local dev)
`vla-eval serve`	Launch model server via `uv run <script>`
`vla-eval merge`	Merge sharded result files (episode-level deduplication)
`vla-eval validate`	Validate config import paths resolve to `Benchmark` subclasses
`vla-eval test-benchmark`	Smoke-test a benchmark Docker image (EchoModelServer, 1 episode — no GPU needed)
`vla-eval test-server`	Smoke-test a model server (StubBenchmark, 3 steps — no Docker needed)

Key run flags: --shard-id / --num-shards (episode sharding), --gpus / --cpus (resource allocation), --no-docker (local dev), --yes (skip prompts).

Contributors

Core contributor: @MilkClouds

Citation

@article{choi2026vlaeval,
  title={vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models},
  author={Choi, Suhwan and Lee, Yunsung and Park, Yubeen and Kim, Chris Dongjoo and Krishna, Ranjay and Fox, Dieter and Yu, Youngjae},
  journal={arXiv preprint arXiv:2603.13966},
  year={2026}
}

Assets 2

Releases: allenai/vla-evaluation-harness

v0.1.0

🔬 Reproduction Results

DimSpec: Convention Validation at Startup (#19)

SimplerEnv Benchmark Rewrite (#25)

New Benchmark

Model Server Fixes

Orchestrator & Infrastructure

Leaderboard & Frontend

Documentation

Contributors

Uh oh!

v0.0.3

Release v0.0.3

Highlights

Critical Fix: Python 3.8 Benchmark Compatibility (#9)

Benchmark Protocol Audit (#7)

Bug Fixes

Leaderboard & CI

Docs & Infra

Contributors

Uh oh!

v0.0.2

Release v0.0.2

Highlights

vla-eval test — Unified Smoke Test CLI

Docker Dev Mode

Rich CLI Output

Bug Fixes

CI & Leaderboard

Breaking Changes

Contributors

Uh oh!

v0.0.1

Release v0.0.1: Initial Release

Highlights

Complete Decoupling of Models and Benchmarks

47x Throughput via Batch Parallel Evaluation

VLA Leaderboard

Quick Start

Supported Benchmarks (14)

Supported Model Servers

Official (8)

Community

CLI

Contributors

Citation

Uh oh!

`vla-eval test` — Unified Smoke Test CLI