v0.0.2
Release v0.0.2
Highlights
vla-eval test — Unified Smoke Test CLI
The three separate commands (validate, test-server, test-benchmark) have been replaced with a single vla-eval test entry point.
vla-eval test # config validation only (fast, safe default)
vla-eval test --all # run all categories
vla-eval test --server # model server smoke tests only
vla-eval test --benchmark # benchmark smoke tests only
vla-eval test --list # show inventory + readiness- Parallel GPU execution:
--parallel [N]with automatic GPU slot allocation - Graceful Ctrl+C: prints partial results + auto-saves stderr logs to
results/smoke-logs/ - Smoke step cap:
max_steps=50prevents timeouts on slow benchmarks --fail-fast,--listreadiness overview
Docker Dev Mode
All 13 benchmark Dockerfiles now use editable install (uv pip install -e .). The new --dev flag on vla-eval run bind-mounts host src/ into the container, enabling rapid iteration without image rebuilds.
vla-eval run --config configs/libero_smoke_test.yaml --devRich CLI Output
Colored pass/fail/skip symbols, category headers, and status lines across all CLI output. Auto-disables when piped (NO_COLOR, isatty, TERM=dumb). rich is lazy-imported to keep CLI startup fast.
Bug Fixes
- Python 3.8 compatibility:
dict[str, Any]→Dict[str, Any],TypeAliasguarded behindTYPE_CHECKING— fixes import errors in 6 benchmark Docker images (LIBERO, CALVIN, RoboCerebra, RLBench) - RLBench Dockerfile: extract BuildKit-only heredoc to standalone
rlbench_entrypoint.sh - RLBench entrypoint: replace
sleepwith Xvfb socket polling for reliable startup
CI & Leaderboard
- Merged redundant
sync-external.ymlintoupdate-data.yml - Added
--sourcefilter input toupdate-dataworkflow dispatch - Fixed unquoted
${{ }}YAML expressions that broke workflow parsing - All leaderboard scripts now write structured summaries to
$GITHUB_OUTPUTfor informative PR bodies - Synced external leaderboard scores (RoboChallenge + RoboArena)
Breaking Changes
vla-eval validate,vla-eval test-server,vla-eval test-benchmarkhave been removed. Usevla-eval testinstead.