Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .claude/skills/add-benchmark/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,34 @@ class MyBenchmark(StepBenchmark):
- **Image preprocessing**: Handle non-standard images (flipped, wrong resolution) in `make_obs()`.
- **EGL headless rendering**: Add `os.environ.setdefault("PYOPENGL_PLATFORM", "egl")` at module top if the sim uses OpenGL.

### Optional: external dataset acquisition

If the benchmark needs licence-restricted scene/data files that can't ship in the docker image (e.g. ToS-gated downloads), do the lazy fetch inside `_init_*()` / `reset()` using the shared primitives in `vla_eval.dirs`:

```python
from vla_eval.dirs import assets_cache, ensure_license

def _ensure_assets(self, data_path: Path) -> None:
if (data_path / "ready_marker").exists():
return
ensure_license(
"my-dataset-tos", # also accepts via --accept-license <id>
url="https://example.com/license",
description="My benchmark dataset ToS (~N GiB).",
)
data_path.mkdir(parents=True, exist_ok=True)
# ... download into data_path with whatever helper your sim provides
```

`ensure_license` reads stdin in interactive contexts and falls back to the `VLA_EVAL_ACCEPTED_LICENSES` env var (forwarded by `vla-eval run --accept-license <id>`). The eval YAML's volume mount should resolve the host path with the same XDG-aware precedence so `vla-eval run` and the in-container fetch agree:

```yaml
volumes:
- "${oc.env:VLA_EVAL_ASSETS_CACHE,${oc.env:VLA_EVAL_HOME,${oc.env:XDG_CACHE_HOME,${oc.env:HOME}/.cache}/vla-eval}/assets}/<bench>:<container_data_path>"
```

Reference: `Behavior1KBenchmark._ensure_assets()` in `benchmarks/behavior1k/benchmark.py`.

## 3. Create config YAML

Create `configs/<name>_eval.yaml`:
Expand Down Expand Up @@ -186,6 +214,12 @@ vla-eval test --validate # validate all config import strin
vla-eval test -c configs/<name>_eval.yaml # smoke-test (1 episode, EchoModelServer, no GPU needed — requires Docker + image)
```

**Don't add `tests/test_<name>_benchmark.py` with mocked sim modules.**
`tests/` is for harness mechanics, not per-sim integration. Fake
`omnigibson` / `sapien` / `mujoco` modules drift from upstream each
release and miss the real bugs (import paths, action encoding,
physics determinism). Verify via the smoke test above.

## Reference implementations

| Benchmark | File | Key patterns |
Expand Down
7 changes: 7 additions & 0 deletions .claude/skills/add-model-server/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,13 @@ make test # existing tests still pas
vla-eval test -c configs/model_servers/<name>.yaml # smoke-test (starts server, sends dummy obs, checks response — requires uv + GPU + model weights)
```

**Don't add `tests/test_<name>_server.py` with mocked model libraries.**
`tests/` is for harness mechanics, not per-model integration. Fake
`transformers` / `torch.nn` / custom inference libs drift from upstream
each release and miss the real bugs (tokenizer versions,
checkpoint-format drift, action denormalisation). Verify via the
smoke test above.

## Reference implementations

| Model | File | Key patterns |
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Every PR triggers lint, type-check, and test jobs automatically (`.github/workfl
```
src/vla_eval/
├── cli/ # CLI entry point (argparse)
├── benchmarks/ # Benchmark adapters (LIBERO, LIBERO-Pro, CALVIN, ManiSkill2, SimplerEnv, RoboCasa, VLABench, MIKASA-Robo, RoboTwin, RLBench, RoboCerebra)
├── benchmarks/ # Benchmark adapters (LIBERO + LIBERO-Pro/Plus/Mem, CALVIN, ManiSkill2, SimplerEnv, RoboCasa, VLABench, MIKASA-Robo, RoboTwin, RLBench, RoboCerebra, RoboMME, MolmoSpaces, Kinetix, BEHAVIOR-1K)
├── model_servers/ # Model server ABCs, utilities, and implementations
├── runners/ # Episode execution loops (sync, async)
├── results/ # Result collection and shard merging
Expand Down
12 changes: 8 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

| | |
|:--|:--|
| **Benchmarks** | [![LIBERO](https://img.shields.io/badge/LIBERO-✓-teal)](configs/libero_all.yaml) [![SimplerEnv](https://img.shields.io/badge/SimplerEnv-✓-teal)](configs/simpler_all_tasks.yaml) [![CALVIN](https://img.shields.io/badge/CALVIN-✓-teal)](configs/calvin_eval.yaml) [![ManiSkill2](https://img.shields.io/badge/ManiSkill2-◇-blue)](configs/maniskill2_eval.yaml) [![LIBERO-Pro](https://img.shields.io/badge/LIBERO--Pro-◇-blue)](configs/libero_pro_eval.yaml) [![LIBERO-Plus](https://img.shields.io/badge/LIBERO--Plus-✓-teal)](configs/libero_plus_spatial.yaml) [![RoboCasa](https://img.shields.io/badge/RoboCasa-◇-blue)](configs/robocasa_eval.yaml) [![VLABench](https://img.shields.io/badge/VLABench-◇-blue)](configs/vlabench_eval.yaml) [![MIKASA-Robo](https://img.shields.io/badge/MIKASA--Robo-◇-blue)](configs/mikasa_eval.yaml) [![RoboTwin](https://img.shields.io/badge/RoboTwin-◇-blue)](configs/robotwin_eval.yaml) [![RLBench](https://img.shields.io/badge/RLBench-◇-blue)](configs/rlbench_eval.yaml) [![RoboCerebra](https://img.shields.io/badge/RoboCerebra-◇-blue)](configs/robocerebra_eval.yaml) [![LIBERO-Mem](https://img.shields.io/badge/LIBERO--Mem-◇-blue)](configs/libero_mem.yaml) ![BEHAVIOR-1K](https://img.shields.io/badge/BEHAVIOR--1K-·-lightgrey) [![Kinetix](https://img.shields.io/badge/Kinetix-◇-blue)](configs/kinetix_eval.yaml) [![RoboMME](https://img.shields.io/badge/RoboMME-✓-teal)](configs/robomme_eval.yaml) [![MolmoSpaces-Bench](https://img.shields.io/badge/MolmoSpaces--Bench-✓-teal)](configs/molmospaces_pick_and_place.yaml) ![FurnitureBench](https://img.shields.io/badge/FurnitureBench-·-lightgrey) |
| **Benchmarks** | [![LIBERO](https://img.shields.io/badge/LIBERO-✓-teal)](configs/libero_all.yaml) [![SimplerEnv](https://img.shields.io/badge/SimplerEnv-✓-teal)](configs/simpler_all_tasks.yaml) [![CALVIN](https://img.shields.io/badge/CALVIN-✓-teal)](configs/calvin_eval.yaml) [![ManiSkill2](https://img.shields.io/badge/ManiSkill2-◇-blue)](configs/maniskill2_eval.yaml) [![LIBERO-Pro](https://img.shields.io/badge/LIBERO--Pro-◇-blue)](configs/libero_pro_eval.yaml) [![LIBERO-Plus](https://img.shields.io/badge/LIBERO--Plus-✓-teal)](configs/libero_plus_spatial.yaml) [![RoboCasa](https://img.shields.io/badge/RoboCasa-◇-blue)](configs/robocasa_eval.yaml) [![VLABench](https://img.shields.io/badge/VLABench-◇-blue)](configs/vlabench_eval.yaml) [![MIKASA-Robo](https://img.shields.io/badge/MIKASA--Robo-◇-blue)](configs/mikasa_eval.yaml) [![RoboTwin](https://img.shields.io/badge/RoboTwin-◇-blue)](configs/robotwin_eval.yaml) [![RLBench](https://img.shields.io/badge/RLBench-◇-blue)](configs/rlbench_eval.yaml) [![RoboCerebra](https://img.shields.io/badge/RoboCerebra-◇-blue)](configs/robocerebra_eval.yaml) [![LIBERO-Mem](https://img.shields.io/badge/LIBERO--Mem-◇-blue)](configs/libero_mem.yaml) [![BEHAVIOR-1K](https://img.shields.io/badge/BEHAVIOR--1K-◇-blue)](configs/behavior1k_eval.yaml) [![Kinetix](https://img.shields.io/badge/Kinetix-◇-blue)](configs/kinetix_eval.yaml) [![RoboMME](https://img.shields.io/badge/RoboMME-✓-teal)](configs/robomme_eval.yaml) [![MolmoSpaces-Bench](https://img.shields.io/badge/MolmoSpaces--Bench-✓-teal)](configs/molmospaces_pick_and_place.yaml) ![FurnitureBench](https://img.shields.io/badge/FurnitureBench-·-lightgrey) |
| **Models (official)** | [![OpenVLA](https://img.shields.io/badge/OpenVLA-✓-8B5CF6)](configs/model_servers/openvla.yaml) [![π₀](https://img.shields.io/badge/π₀-✓-8B5CF6)](configs/model_servers/pi0_libero.yaml) [![π₀-FAST](https://img.shields.io/badge/π₀--FAST-✓-8B5CF6)](configs/model_servers/pi0_libero.yaml) [![GR00T N1.6](https://img.shields.io/badge/GR00T_N1.6-✓-8B5CF6)](configs/model_servers/groot.yaml) [![OFT](https://img.shields.io/badge/OFT-✓-8B5CF6)](configs/model_servers/oft_libero.yaml) [![X-VLA](https://img.shields.io/badge/X--VLA-✓-8B5CF6)](configs/model_servers/xvla_libero.yaml) [![CogACT](https://img.shields.io/badge/CogACT-◇-blue)](configs/model_servers/cogact.yaml) [![RTC](https://img.shields.io/badge/RTC-◇-blue)](configs/model_servers/rtc_kinetix.yaml) [![VLANeXt](https://img.shields.io/badge/VLANeXt-✓-8B5CF6)](configs/model_servers/vlanext/libero_spatial.yaml) [![MolmoBot](https://img.shields.io/badge/MolmoBot-✓-8B5CF6)](configs/model_servers/molmobot/droid.yaml) ![MemVLA](https://img.shields.io/badge/MemVLA-·-lightgrey) |
| **Models ([dexbotic](https://github.com/dexmal/dexbotic))** ![stars](https://img.shields.io/github/stars/dexmal/dexbotic?style=social) | [![DB-CogACT](https://img.shields.io/badge/DB--CogACT-✓-8B5CF6)](configs/model_servers/dexbotic_cogact_libero.yaml) |
| **Models ([starVLA](https://github.com/starVLA/starVLA))** ![stars](https://img.shields.io/github/stars/starVLA/starVLA?style=social) | [![QwenGR00T](https://img.shields.io/badge/QwenGR00T-✓-8B5CF6)](configs/model_servers/starvla_groot_simpler.yaml) [![QwenOFT](https://img.shields.io/badge/QwenOFT-✓-8B5CF6)](configs/model_servers/starvla_oft_simpler.yaml) [![QwenPI](https://img.shields.io/badge/QwenPI-◇-blue)](configs/model_servers/starvla_pi_simpler.yaml) [![QwenFAST](https://img.shields.io/badge/QwenFAST-✓-8B5CF6)](configs/model_servers/starvla_fast_simpler.yaml) |
Expand Down Expand Up @@ -150,7 +150,7 @@ All benchmark environments are packaged as standalone Docker images based on `ba
| Image | Size | Benchmark | Python | Base |
|-------|------|-----------|--------|------|
| [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) | 3.3 GB | — | — | `nvidia/cuda:12.1.1-runtime-ubuntu22.04` |
| [`rlbench`](https://ghcr.io/allenai/vla-evaluation-harness/rlbench) | 4.7 GB | RLBench | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| `rlbench` 🔒 | 4.7 GB | RLBench | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`simpler`](https://ghcr.io/allenai/vla-evaluation-harness/simpler) | 4.9 GB | SimplerEnv | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`libero`](https://ghcr.io/allenai/vla-evaluation-harness/libero) | 6.0 GB | LIBERO | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`libero-pro`](https://ghcr.io/allenai/vla-evaluation-harness/libero-pro) | 6.2 GB | LIBERO-Pro | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
Expand All @@ -163,10 +163,13 @@ All benchmark environments are packaged as standalone Docker images based on `ba
| [`libero-plus`](https://ghcr.io/allenai/vla-evaluation-harness/libero-plus) | 14.8 GB | LIBERO-Plus | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`robomme`](https://ghcr.io/allenai/vla-evaluation-harness/robomme) | 17.0 GB | RoboMME | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`vlabench`](https://ghcr.io/allenai/vla-evaluation-harness/vlabench) | 17.7 GB | VLABench | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| `behavior1k` 🔒 | 23.6 GB | BEHAVIOR-1K | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`robotwin`](https://ghcr.io/allenai/vla-evaluation-harness/robotwin) | 28.6 GB | RoboTwin 2.0 | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`molmospaces`](https://ghcr.io/allenai/vla-evaluation-harness/molmospaces) | 31.4 GB | MolmoSpaces-Bench | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`robocasa`](https://ghcr.io/allenai/vla-evaluation-harness/robocasa) | 35.6 GB | RoboCasa | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |

<sub>🔒 = build-locally only; the Dockerfile gates the build behind a licence opt-in (`docker/build.sh <name> --accept-license <name>`) and the image isn't published to ghcr.io.</sub>

**Pull** (recommended):

```bash
Expand All @@ -176,8 +179,9 @@ docker pull ghcr.io/allenai/vla-evaluation-harness/libero:latest
**Build locally** (see [docker/build.sh](docker/build.sh)):

```bash
docker/build.sh # build all (base first, then benchmarks)
docker/build.sh libero # build one
docker/build.sh # build all (gated images skipped)
docker/build.sh libero # build one
docker/build.sh behavior1k --accept-license behavior1k # build a gated image
```

---
Expand Down
44 changes: 44 additions & 0 deletions configs/behavior1k_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# BEHAVIOR-1K (OmniGibson / Isaac Sim) — 50-task household-activity suite.
#
# First run prompts on stdin to accept the BEHAVIOR Dataset ToS and then downloads ~35 GiB of OmniGibson
# scene + task data into the asset cache (``$VLA_EVAL_ASSETS_CACHE`` if set, else ``$VLA_EVAL_HOME/assets``,
# else ``$XDG_CACHE_HOME/vla-eval/assets``, else ``~/.cache/vla-eval/assets``). Pass
# ``--accept-license behavior-dataset-tos`` to skip the prompt in non-interactive contexts (CI, sharded
# runs). An NVIDIA GPU with Vulkan + EGL is required.
server:
url: "ws://localhost:8000"

docker:
image: ghcr.io/allenai/vla-evaluation-harness/behavior1k:latest
env:
- "NVIDIA_DRIVER_CAPABILITIES=all"
- "OMNIGIBSON_HEADLESS=1"
- "OMNI_KIT_ACCEPT_EULA=YES"
# Pin Isaac Sim/Vulkan to a single NVIDIA ICD. Without this both the
# base image's baked-in /usr/share/vulkan/icd.d/nvidia_icd.json and
# the nvidia-container-toolkit-injected /etc/vulkan/icd.d/nvidia_icd.json
# are visible at runtime; that triggers a "Multiple ICDs for the same
# GPU" error and a segfault deep in omni.kit.xr on first launch.
- "VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json"
volumes:
# OmniGibson reads ``gm.DATA_PATH=/app/BEHAVIOR-1K/datasets`` at import time. The host path mirrors
# ``vla_eval.dirs.assets_cache``'s precedence so ``vla-eval run`` and the in-container fetch agree.
# Mounted writable so the first-run download can populate the cache; subsequent runs are read-only
# in practice.
- "${oc.env:VLA_EVAL_ASSETS_CACHE,${oc.env:VLA_EVAL_HOME,${oc.env:XDG_CACHE_HOME,${oc.env:HOME}/.cache}/vla-eval}/assets}/behavior1k:/app/BEHAVIOR-1K/datasets"

output_dir: "./results"

benchmarks:
- benchmark: "vla_eval.benchmarks.behavior1k.benchmark:Behavior1KBenchmark"
subname: turning_on_radio
mode: sync
episodes_per_task: 1
params:
tasks:
- turning_on_radio
partial_scene_load: true
send_proprio: false
max_steps: 2000
task_instance_id: 1
action_dim: 23
7 changes: 7 additions & 0 deletions configs/model_servers/behavior1k/baseline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# BEHAVIOR-1K — zero-action baseline (R1Pro 23-D).
# Mirrors the default LocalPolicy(action_dim=23) baseline used by the
# official OmniGibson eval script when no policy weights are provided.
script: "src/vla_eval/model_servers/behavior1k_baseline.py"
args:
action_dim: 23
port: 8000
13 changes: 13 additions & 0 deletions configs/model_servers/behavior1k/demo_replay.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# BEHAVIOR-1K — demo-replay model server (LeRobot v2.1 parquet).
# Replays the recorded action stream from an annotated human-teleop
# episode. Used to verify that the env wiring (action space, success
# detection, observation cameras) matches the released dataset before
# touching real model weights.
#
# Replace ``demo_path`` with a path to a single-episode parquet file
# from the BEHAVIOR Dataset's LeRobot v2.1 release, e.g.:
# /data/behavior_dataset/turning_on_radio/episode_001.parquet
script: "src/vla_eval/model_servers/behavior1k_demo_replay.py"
args:
demo_path: "/data/behavior_dataset/turning_on_radio/episode_001.parquet"
port: 8000
Loading
Loading