Release v0.1.0 · allenai/vla-evaluation-harness

We're excited to release v0.1.0 — featuring systematic reproduction of published VLA benchmark scores across 6 models and 3 benchmarks with a single unified harness.

🔬 Reproduction Results

6 VLA codebases reproduced across 3 benchmarks using only their public checkpoints and this harness — no benchmark-specific forks, no manual glue code.

Codebase	LIBERO (%)	CALVIN (len)	SimplerEnv (%)
OpenVLA	76.2 (−0.3)	—	—
π₀.₅	97.7 (+0.8)	—	—
OpenVLA-OFT	96.7 (−0.4)	—	—
GR00T N1.6	94.9 (−2.1)†	—	59.7 (−8.0)‡
DB-CogACT	94.7 (−0.2)	4.02 (−0.04)	63.5 (−6.0)
X-VLA	97.4 (−0.7)	4.30 (−0.13)	94.8 (−1.0)

Values show our score (Δ vs. reported). — = no public checkpoint. †Community checkpoint. ‡See reproduction notes.

Full per-task breakdowns and reproduction logs: docs/reproductions/.

DimSpec: Convention Validation at Startup (#19)

Every model server and benchmark now declares its action/observation format (dimensions, rotation convention, coordinate frame) via a DimSpec. The orchestrator cross-validates these specs during the HELLO handshake — catching mismatches before any GPU time is spent.

SimplerEnv Benchmark Rewrite (#25)

SimplerEnv has been rewritten to use simpler_env.make(task_name) with prepackaged_config, replacing the previous manual env setup. Image resize responsibility moved from benchmarks to model servers — benchmarks now send native resolution. 5 models reproduced after the rewrite.

New Benchmark

RoboMME (#12, @alohays) — 16 memory-augmented manipulation tasks across 4 cognitive suites (Counting, Permanence, Reference, Imitation). First benchmark contribution.

Model Server Fixes

X-VLA: rot6d convention for LIBERO, euler→axangle for SimplerEnv delta control, absolute EE control, base-relative EE pose computation
GR00T: accumulate_success + overlay removal, prepackaged_config, base-relative EE state, replace transforms3d with local rotation.py
starVLA: configurable unnormalization — minmax vs q99 (#23, @junhalee-sqzb); logging hijack fix (#17, @junhalee-sqzb)
DB-CogACT: SimplerEnv image resize regression fix

Orchestrator & Infrastructure

Error isolation: failure_reason + failure_detail (with traceback) on every episode error; infra errors excluded from success metrics
Filelock: shard result file collision prevention for parallel evaluations; try/finally lock release guarantee
Progress monitoring: .progress files for live shard tracking
Reconnect HELLO: re-establish protocol on implicit WebSocket reconnect
CLI: --output-dir override, JSON string parsing for list/dict arguments
Orchestrator refactor: deduplicate output paths, error handlers, and progress tracking

Leaderboard & Frontend

Mobile performance (#13): 50-row pagination + shared tooltip replaces 13,800 DOM elements — ~90% node reduction
Data update (#18): 17 benchmarks, 512 models, 661 results

Documentation

Structured reproduction results under docs/reproductions/
Cross-benchmark pipeline verification audit
Common reproduction pitfalls guide (unnormalization modes, internal forks, Docker rebuild)

Contributors

@MilkClouds, @alohays, @junhalee-sqzb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.0

Choose a tag to compare

Sorry, something went wrong.