Skip to content

v0.0.2

Choose a tag to compare

@MilkClouds MilkClouds released this 20 Mar 03:14
· 200 commits to main since this release

Release v0.0.2

Highlights

vla-eval test — Unified Smoke Test CLI

The three separate commands (validate, test-server, test-benchmark) have been replaced with a single vla-eval test entry point.

vla-eval test                # config validation only (fast, safe default)
vla-eval test --all          # run all categories
vla-eval test --server       # model server smoke tests only
vla-eval test --benchmark    # benchmark smoke tests only
vla-eval test --list         # show inventory + readiness
  • Parallel GPU execution: --parallel [N] with automatic GPU slot allocation
  • Graceful Ctrl+C: prints partial results + auto-saves stderr logs to results/smoke-logs/
  • Smoke step cap: max_steps=50 prevents timeouts on slow benchmarks
  • --fail-fast, --list readiness overview

Docker Dev Mode

All 13 benchmark Dockerfiles now use editable install (uv pip install -e .). The new --dev flag on vla-eval run bind-mounts host src/ into the container, enabling rapid iteration without image rebuilds.

vla-eval run --config configs/libero_smoke_test.yaml --dev

Rich CLI Output

Colored pass/fail/skip symbols, category headers, and status lines across all CLI output. Auto-disables when piped (NO_COLOR, isatty, TERM=dumb). rich is lazy-imported to keep CLI startup fast.

Bug Fixes

  • Python 3.8 compatibility: dict[str, Any]Dict[str, Any], TypeAlias guarded behind TYPE_CHECKING — fixes import errors in 6 benchmark Docker images (LIBERO, CALVIN, RoboCerebra, RLBench)
  • RLBench Dockerfile: extract BuildKit-only heredoc to standalone rlbench_entrypoint.sh
  • RLBench entrypoint: replace sleep with Xvfb socket polling for reliable startup

CI & Leaderboard

  • Merged redundant sync-external.yml into update-data.yml
  • Added --source filter input to update-data workflow dispatch
  • Fixed unquoted ${{ }} YAML expressions that broke workflow parsing
  • All leaderboard scripts now write structured summaries to $GITHUB_OUTPUT for informative PR bodies
  • Synced external leaderboard scores (RoboChallenge + RoboArena)

Breaking Changes

  • vla-eval validate, vla-eval test-server, vla-eval test-benchmark have been removed. Use vla-eval test instead.

Contributors

@MilkClouds