Skip to content

v0.1.0

Latest

Choose a tag to compare

@MilkClouds MilkClouds released this 05 Apr 15:13
· 64 commits to main since this release

We're excited to release v0.1.0 — featuring systematic reproduction of published VLA benchmark scores across 6 models and 3 benchmarks with a single unified harness.

🔬 Reproduction Results

6 VLA codebases reproduced across 3 benchmarks using only their public checkpoints and this harness — no benchmark-specific forks, no manual glue code.

Codebase LIBERO (%) CALVIN (len) SimplerEnv (%)
OpenVLA 76.2 (−0.3)
π₀.₅ 97.7 (+0.8)
OpenVLA-OFT 96.7 (−0.4)
GR00T N1.6 94.9 (−2.1)† 59.7 (−8.0)‡
DB-CogACT 94.7 (−0.2) 4.02 (−0.04) 63.5 (−6.0)
X-VLA 97.4 (−0.7) 4.30 (−0.13) 94.8 (−1.0)

Values show our score (Δ vs. reported). — = no public checkpoint. †Community checkpoint. ‡See reproduction notes.

Full per-task breakdowns and reproduction logs: docs/reproductions/.

DimSpec: Convention Validation at Startup (#19)

Every model server and benchmark now declares its action/observation format (dimensions, rotation convention, coordinate frame) via a DimSpec. The orchestrator cross-validates these specs during the HELLO handshake — catching mismatches before any GPU time is spent.

SimplerEnv Benchmark Rewrite (#25)

SimplerEnv has been rewritten to use simpler_env.make(task_name) with prepackaged_config, replacing the previous manual env setup. Image resize responsibility moved from benchmarks to model servers — benchmarks now send native resolution. 5 models reproduced after the rewrite.

New Benchmark

  • RoboMME (#12, @alohays) — 16 memory-augmented manipulation tasks across 4 cognitive suites (Counting, Permanence, Reference, Imitation). First benchmark contribution.

Model Server Fixes

  • X-VLA: rot6d convention for LIBERO, euler→axangle for SimplerEnv delta control, absolute EE control, base-relative EE pose computation
  • GR00T: accumulate_success + overlay removal, prepackaged_config, base-relative EE state, replace transforms3d with local rotation.py
  • starVLA: configurable unnormalization — minmax vs q99 (#23, @junhalee-sqzb); logging hijack fix (#17, @junhalee-sqzb)
  • DB-CogACT: SimplerEnv image resize regression fix

Orchestrator & Infrastructure

  • Error isolation: failure_reason + failure_detail (with traceback) on every episode error; infra errors excluded from success metrics
  • Filelock: shard result file collision prevention for parallel evaluations; try/finally lock release guarantee
  • Progress monitoring: .progress files for live shard tracking
  • Reconnect HELLO: re-establish protocol on implicit WebSocket reconnect
  • CLI: --output-dir override, JSON string parsing for list/dict arguments
  • Orchestrator refactor: deduplicate output paths, error handlers, and progress tracking

Leaderboard & Frontend

  • Mobile performance (#13): 50-row pagination + shared tooltip replaces 13,800 DOM elements — ~90% node reduction
  • Data update (#18): 17 benchmarks, 512 models, 661 results

Documentation

  • Structured reproduction results under docs/reproductions/
  • Cross-benchmark pipeline verification audit
  • Common reproduction pitfalls guide (unnormalization modes, internal forks, Docker rebuild)

Contributors

@MilkClouds, @alohays, @junhalee-sqzb