We're excited to release v0.1.0 — featuring systematic reproduction of published VLA benchmark scores across 6 models and 3 benchmarks with a single unified harness.
🔬 Reproduction Results
6 VLA codebases reproduced across 3 benchmarks using only their public checkpoints and this harness — no benchmark-specific forks, no manual glue code.
| Codebase | LIBERO (%) | CALVIN (len) | SimplerEnv (%) |
|---|---|---|---|
| OpenVLA | 76.2 (−0.3) | — | — |
| π₀.₅ | 97.7 (+0.8) | — | — |
| OpenVLA-OFT | 96.7 (−0.4) | — | — |
| GR00T N1.6 | 94.9 (−2.1)† | — | 59.7 (−8.0)‡ |
| DB-CogACT | 94.7 (−0.2) | 4.02 (−0.04) | 63.5 (−6.0) |
| X-VLA | 97.4 (−0.7) | 4.30 (−0.13) | 94.8 (−1.0) |
Values show our score (Δ vs. reported). — = no public checkpoint. †Community checkpoint. ‡See reproduction notes.
Full per-task breakdowns and reproduction logs: docs/reproductions/.
DimSpec: Convention Validation at Startup (#19)
Every model server and benchmark now declares its action/observation format (dimensions, rotation convention, coordinate frame) via a DimSpec. The orchestrator cross-validates these specs during the HELLO handshake — catching mismatches before any GPU time is spent.
SimplerEnv Benchmark Rewrite (#25)
SimplerEnv has been rewritten to use simpler_env.make(task_name) with prepackaged_config, replacing the previous manual env setup. Image resize responsibility moved from benchmarks to model servers — benchmarks now send native resolution. 5 models reproduced after the rewrite.
New Benchmark
- RoboMME (#12, @alohays) — 16 memory-augmented manipulation tasks across 4 cognitive suites (Counting, Permanence, Reference, Imitation). First benchmark contribution.
Model Server Fixes
- X-VLA: rot6d convention for LIBERO, euler→axangle for SimplerEnv delta control, absolute EE control, base-relative EE pose computation
- GR00T:
accumulate_success+ overlay removal,prepackaged_config, base-relative EE state, replacetransforms3dwith localrotation.py - starVLA: configurable unnormalization —
minmaxvsq99(#23, @junhalee-sqzb); logging hijack fix (#17, @junhalee-sqzb) - DB-CogACT: SimplerEnv image resize regression fix
Orchestrator & Infrastructure
- Error isolation:
failure_reason+failure_detail(with traceback) on every episode error; infra errors excluded from success metrics - Filelock: shard result file collision prevention for parallel evaluations;
try/finallylock release guarantee - Progress monitoring:
.progressfiles for live shard tracking - Reconnect HELLO: re-establish protocol on implicit WebSocket reconnect
- CLI:
--output-diroverride, JSON string parsing for list/dict arguments - Orchestrator refactor: deduplicate output paths, error handlers, and progress tracking
Leaderboard & Frontend
- Mobile performance (#13): 50-row pagination + shared tooltip replaces 13,800 DOM elements — ~90% node reduction
- Data update (#18): 17 benchmarks, 512 models, 661 results
Documentation
- Structured reproduction results under
docs/reproductions/ - Cross-benchmark pipeline verification audit
- Common reproduction pitfalls guide (unnormalization modes, internal forks, Docker rebuild)