Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .claude/skills/add-benchmark/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,12 @@ vla-eval test --validate # validate all config import strin
vla-eval test -c configs/<name>_eval.yaml # smoke-test (1 episode, EchoModelServer, no GPU needed — requires Docker + image)
```

**Don't add `tests/test_<name>_benchmark.py` with mocked sim modules.**
`tests/` is for harness mechanics, not per-sim integration. Fake
`omnigibson` / `sapien` / `mujoco` modules drift from upstream each
release and miss the real bugs (import paths, action encoding,
physics determinism). Verify via the smoke test above.

## Reference implementations

| Benchmark | File | Key patterns |
Expand Down
7 changes: 7 additions & 0 deletions .claude/skills/add-model-server/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,13 @@ make test # existing tests still pas
vla-eval test -c configs/model_servers/<name>.yaml # smoke-test (starts server, sends dummy obs, checks response — requires uv + GPU + model weights)
```

**Don't add `tests/test_<name>_server.py` with mocked model libraries.**
`tests/` is for harness mechanics, not per-model integration. Fake
`transformers` / `torch.nn` / custom inference libs drift from upstream
each release and miss the real bugs (tokenizer versions,
checkpoint-format drift, action denormalisation). Verify via the
smoke test above.

## Reference implementations

| Model | File | Key patterns |
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Every PR triggers lint, type-check, and test jobs automatically (`.github/workfl
```
src/vla_eval/
├── cli/ # CLI entry point (argparse)
├── benchmarks/ # Benchmark adapters (LIBERO, LIBERO-Pro, CALVIN, ManiSkill2, SimplerEnv, RoboCasa, VLABench, MIKASA-Robo, RoboTwin, RLBench, RoboCerebra)
├── benchmarks/ # Benchmark adapters (LIBERO + LIBERO-Pro/Plus/Mem, CALVIN, ManiSkill2, SimplerEnv, RoboCasa, VLABench, MIKASA-Robo, RoboTwin, RLBench, RoboCerebra, RoboMME, MolmoSpaces, Kinetix, BEHAVIOR-1K)
├── model_servers/ # Model server ABCs, utilities, and implementations
├── runners/ # Episode execution loops (sync, async)
├── results/ # Result collection and shard merging
Expand Down
12 changes: 8 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

| | |
|:--|:--|
| **Benchmarks** | [![LIBERO](https://img.shields.io/badge/LIBERO-✓-teal)](configs/libero_all.yaml) [![SimplerEnv](https://img.shields.io/badge/SimplerEnv-✓-teal)](configs/simpler_all_tasks.yaml) [![CALVIN](https://img.shields.io/badge/CALVIN-✓-teal)](configs/calvin_eval.yaml) [![ManiSkill2](https://img.shields.io/badge/ManiSkill2-◇-blue)](configs/maniskill2_eval.yaml) [![LIBERO-Pro](https://img.shields.io/badge/LIBERO--Pro-◇-blue)](configs/libero_pro_eval.yaml) [![LIBERO-Plus](https://img.shields.io/badge/LIBERO--Plus-✓-teal)](configs/libero_plus_spatial.yaml) [![RoboCasa](https://img.shields.io/badge/RoboCasa-◇-blue)](configs/robocasa_eval.yaml) [![VLABench](https://img.shields.io/badge/VLABench-◇-blue)](configs/vlabench_eval.yaml) [![MIKASA-Robo](https://img.shields.io/badge/MIKASA--Robo-◇-blue)](configs/mikasa_eval.yaml) [![RoboTwin](https://img.shields.io/badge/RoboTwin-◇-blue)](configs/robotwin_eval.yaml) [![RLBench](https://img.shields.io/badge/RLBench-◇-blue)](configs/rlbench_eval.yaml) [![RoboCerebra](https://img.shields.io/badge/RoboCerebra-◇-blue)](configs/robocerebra_eval.yaml) [![LIBERO-Mem](https://img.shields.io/badge/LIBERO--Mem-◇-blue)](configs/libero_mem.yaml) ![BEHAVIOR-1K](https://img.shields.io/badge/BEHAVIOR--1K-·-lightgrey) [![Kinetix](https://img.shields.io/badge/Kinetix-◇-blue)](configs/kinetix_eval.yaml) [![RoboMME](https://img.shields.io/badge/RoboMME-✓-teal)](configs/robomme_eval.yaml) [![MolmoSpaces-Bench](https://img.shields.io/badge/MolmoSpaces--Bench-✓-teal)](configs/molmospaces_pick_and_place.yaml) ![FurnitureBench](https://img.shields.io/badge/FurnitureBench-·-lightgrey) |
| **Benchmarks** | [![LIBERO](https://img.shields.io/badge/LIBERO-✓-teal)](configs/libero_all.yaml) [![SimplerEnv](https://img.shields.io/badge/SimplerEnv-✓-teal)](configs/simpler_all_tasks.yaml) [![CALVIN](https://img.shields.io/badge/CALVIN-✓-teal)](configs/calvin_eval.yaml) [![ManiSkill2](https://img.shields.io/badge/ManiSkill2-◇-blue)](configs/maniskill2_eval.yaml) [![LIBERO-Pro](https://img.shields.io/badge/LIBERO--Pro-◇-blue)](configs/libero_pro_eval.yaml) [![LIBERO-Plus](https://img.shields.io/badge/LIBERO--Plus-✓-teal)](configs/libero_plus_spatial.yaml) [![RoboCasa](https://img.shields.io/badge/RoboCasa-◇-blue)](configs/robocasa_eval.yaml) [![VLABench](https://img.shields.io/badge/VLABench-◇-blue)](configs/vlabench_eval.yaml) [![MIKASA-Robo](https://img.shields.io/badge/MIKASA--Robo-◇-blue)](configs/mikasa_eval.yaml) [![RoboTwin](https://img.shields.io/badge/RoboTwin-◇-blue)](configs/robotwin_eval.yaml) [![RLBench](https://img.shields.io/badge/RLBench-◇-blue)](configs/rlbench_eval.yaml) [![RoboCerebra](https://img.shields.io/badge/RoboCerebra-◇-blue)](configs/robocerebra_eval.yaml) [![LIBERO-Mem](https://img.shields.io/badge/LIBERO--Mem-◇-blue)](configs/libero_mem.yaml) [![BEHAVIOR-1K](https://img.shields.io/badge/BEHAVIOR--1K-◇-blue)](configs/behavior1k_eval.yaml) [![Kinetix](https://img.shields.io/badge/Kinetix-◇-blue)](configs/kinetix_eval.yaml) [![RoboMME](https://img.shields.io/badge/RoboMME-✓-teal)](configs/robomme_eval.yaml) [![MolmoSpaces-Bench](https://img.shields.io/badge/MolmoSpaces--Bench-✓-teal)](configs/molmospaces_pick_and_place.yaml) ![FurnitureBench](https://img.shields.io/badge/FurnitureBench-·-lightgrey) |
| **Models (official)** | [![OpenVLA](https://img.shields.io/badge/OpenVLA-✓-8B5CF6)](configs/model_servers/openvla.yaml) [![π₀](https://img.shields.io/badge/π₀-✓-8B5CF6)](configs/model_servers/pi0_libero.yaml) [![π₀-FAST](https://img.shields.io/badge/π₀--FAST-✓-8B5CF6)](configs/model_servers/pi0_libero.yaml) [![GR00T N1.6](https://img.shields.io/badge/GR00T_N1.6-✓-8B5CF6)](configs/model_servers/groot.yaml) [![OFT](https://img.shields.io/badge/OFT-✓-8B5CF6)](configs/model_servers/oft_libero.yaml) [![X-VLA](https://img.shields.io/badge/X--VLA-✓-8B5CF6)](configs/model_servers/xvla_libero.yaml) [![CogACT](https://img.shields.io/badge/CogACT-◇-blue)](configs/model_servers/cogact.yaml) [![RTC](https://img.shields.io/badge/RTC-◇-blue)](configs/model_servers/rtc_kinetix.yaml) [![VLANeXt](https://img.shields.io/badge/VLANeXt-✓-8B5CF6)](configs/model_servers/vlanext/libero_spatial.yaml) [![MolmoBot](https://img.shields.io/badge/MolmoBot-✓-8B5CF6)](configs/model_servers/molmobot/droid.yaml) ![MemVLA](https://img.shields.io/badge/MemVLA-·-lightgrey) |
| **Models ([dexbotic](https://github.com/dexmal/dexbotic))** ![stars](https://img.shields.io/github/stars/dexmal/dexbotic?style=social) | [![DB-CogACT](https://img.shields.io/badge/DB--CogACT-✓-8B5CF6)](configs/model_servers/dexbotic_cogact_libero.yaml) |
| **Models ([starVLA](https://github.com/starVLA/starVLA))** ![stars](https://img.shields.io/github/stars/starVLA/starVLA?style=social) | [![QwenGR00T](https://img.shields.io/badge/QwenGR00T-✓-8B5CF6)](configs/model_servers/starvla_groot_simpler.yaml) [![QwenOFT](https://img.shields.io/badge/QwenOFT-✓-8B5CF6)](configs/model_servers/starvla_oft_simpler.yaml) [![QwenPI](https://img.shields.io/badge/QwenPI-◇-blue)](configs/model_servers/starvla_pi_simpler.yaml) [![QwenFAST](https://img.shields.io/badge/QwenFAST-✓-8B5CF6)](configs/model_servers/starvla_fast_simpler.yaml) |
Expand Down Expand Up @@ -150,7 +150,7 @@ All benchmark environments are packaged as standalone Docker images based on `ba
| Image | Size | Benchmark | Python | Base |
|-------|------|-----------|--------|------|
| [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) | 3.3 GB | — | — | `nvidia/cuda:12.1.1-runtime-ubuntu22.04` |
| [`rlbench`](https://ghcr.io/allenai/vla-evaluation-harness/rlbench) | 4.7 GB | RLBench | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| `rlbench` 🔒 | 4.7 GB | RLBench | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`simpler`](https://ghcr.io/allenai/vla-evaluation-harness/simpler) | 4.9 GB | SimplerEnv | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`libero`](https://ghcr.io/allenai/vla-evaluation-harness/libero) | 6.0 GB | LIBERO | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`libero-pro`](https://ghcr.io/allenai/vla-evaluation-harness/libero-pro) | 6.2 GB | LIBERO-Pro | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
Expand All @@ -163,10 +163,13 @@ All benchmark environments are packaged as standalone Docker images based on `ba
| [`libero-plus`](https://ghcr.io/allenai/vla-evaluation-harness/libero-plus) | 14.8 GB | LIBERO-Plus | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`robomme`](https://ghcr.io/allenai/vla-evaluation-harness/robomme) | 17.0 GB | RoboMME | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`vlabench`](https://ghcr.io/allenai/vla-evaluation-harness/vlabench) | 17.7 GB | VLABench | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| `behavior1k` 🔒 | 23.6 GB | BEHAVIOR-1K | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`robotwin`](https://ghcr.io/allenai/vla-evaluation-harness/robotwin) | 28.6 GB | RoboTwin 2.0 | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`molmospaces`](https://ghcr.io/allenai/vla-evaluation-harness/molmospaces) | 31.4 GB | MolmoSpaces-Bench | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
| [`robocasa`](https://ghcr.io/allenai/vla-evaluation-harness/robocasa) | 35.6 GB | RoboCasa | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |

<sub>🔒 = build-locally only; the Dockerfile gates the build behind a licence opt-in (`docker/build.sh <name> --accept-license <name>`) and the image isn't published to ghcr.io.</sub>

**Pull** (recommended):

```bash
Expand All @@ -176,8 +179,9 @@ docker pull ghcr.io/allenai/vla-evaluation-harness/libero:latest
**Build locally** (see [docker/build.sh](docker/build.sh)):

```bash
docker/build.sh # build all (base first, then benchmarks)
docker/build.sh libero # build one
docker/build.sh # build all (gated images skipped)
docker/build.sh libero # build one
docker/build.sh behavior1k --accept-license behavior1k # build a gated image
```

---
Expand Down
43 changes: 43 additions & 0 deletions configs/behavior1k_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# BEHAVIOR-1K (OmniGibson / Isaac Sim) — 50-task household-activity suite.
#
# Before running, edit the dataset volume below to point at your local
# BEHAVIOR-1K data directory (the one populated by the three
# ``download_*`` calls documented in docs/reproductions/behavior1k.md).
# An NVIDIA GPU with Vulkan + EGL is required.
server:
url: "ws://localhost:8000"

docker:
image: ghcr.io/allenai/vla-evaluation-harness/behavior1k:latest
env:
- "NVIDIA_DRIVER_CAPABILITIES=all"
- "OMNIGIBSON_HEADLESS=1"
- "OMNI_KIT_ACCEPT_EULA=YES"
# Pin Isaac Sim/Vulkan to a single NVIDIA ICD. Without this both the
# base image's baked-in /usr/share/vulkan/icd.d/nvidia_icd.json and
# the nvidia-container-toolkit-injected /etc/vulkan/icd.d/nvidia_icd.json
# are visible at runtime; that triggers a "Multiple ICDs for the same
# GPU" error and a segfault deep in omni.kit.xr on first launch.
- "VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json"
volumes:
# OmniGibson reads gm.DATA_PATH=/app/BEHAVIOR-1K/datasets at import time.
# Replace the host side with the directory holding
# ``omnigibson-robot-assets/``, ``behavior-1k-assets/``, and
# ``2025-challenge-task-instances/``.
- "/data/og_data:/app/BEHAVIOR-1K/datasets:ro"

output_dir: "./results"

benchmarks:
- benchmark: "vla_eval.benchmarks.behavior1k.benchmark:Behavior1KBenchmark"
subname: turning_on_radio
mode: sync
episodes_per_task: 1
params:
tasks:
- turning_on_radio
partial_scene_load: true
send_proprio: false
max_steps: 2000
task_instance_id: 1
action_dim: 23
7 changes: 7 additions & 0 deletions configs/model_servers/behavior1k/baseline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# BEHAVIOR-1K — zero-action baseline (R1Pro 23-D).
# Mirrors the default LocalPolicy(action_dim=23) baseline used by the
# official OmniGibson eval script when no policy weights are provided.
script: "src/vla_eval/model_servers/behavior1k_baseline.py"
args:
action_dim: 23
port: 8000
13 changes: 13 additions & 0 deletions configs/model_servers/behavior1k/demo_replay.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# BEHAVIOR-1K — demo-replay model server (LeRobot v2.1 parquet).
# Replays the recorded action stream from an annotated human-teleop
# episode. Used to verify that the env wiring (action space, success
# detection, observation cameras) matches the released dataset before
# touching real model weights.
#
# Replace ``demo_path`` with a path to a single-episode parquet file
# from the BEHAVIOR Dataset's LeRobot v2.1 release, e.g.:
# /data/behavior_dataset/turning_on_radio/episode_001.parquet
script: "src/vla_eval/model_servers/behavior1k_demo_replay.py"
args:
demo_path: "/data/behavior_dataset/turning_on_radio/episode_001.parquet"
port: 8000
120 changes: 120 additions & 0 deletions docker/Dockerfile.behavior1k
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# BEHAVIOR-1K — OmniGibson on NVIDIA Isaac Sim (https://behavior.stanford.edu)
#
# Heavy image: pulls Isaac Sim wheels (~12 GB) and the BEHAVIOR-1K
# source tree. The dataset itself (~10 GB) is NOT baked in; mount it
# at runtime under /app/BEHAVIOR-1K/datasets.
#
# Hardware requirements: NVIDIA GPU (RTX 2070+), 8 GB+ VRAM, Vulkan ICD.

ARG BASE_IMAGE=ghcr.io/allenai/vla-evaluation-harness/base:latest
FROM ${BASE_IMAGE}

# Build-time license confirmation. The user must explicitly opt in
# the same way Stanford's setup.sh requires --accept-nvidia-eula.
ARG ACCEPT_NVIDIA_EULA=
RUN if [ "$ACCEPT_NVIDIA_EULA" != "YES" ]; then \
echo ""; \
echo "============================================================"; \
echo "Building BEHAVIOR-1K requires accepting two licenses:"; \
echo " 1. NVIDIA Isaac Sim EULA"; \
echo " https://docs.omniverse.nvidia.com/eula/"; \
echo " 2. BEHAVIOR Dataset Terms of Service (at runtime, when"; \
echo " you download/mount the encrypted scene+object bundle)"; \
echo ""; \
echo "Read the EULAs above, then re-run with:"; \
echo " docker build --build-arg ACCEPT_NVIDIA_EULA=YES ..."; \
echo " (or: docker/build.sh behavior1k --accept-nvidia-eula)"; \
echo "============================================================"; \
exit 1; \
fi

ENV OMNIGIBSON_HEADLESS=1 \
OMNI_KIT_ACCEPT_EULA=YES \
ACCEPT_EULA=Y \
PRIVACY_CONSENT=Y

# ── Conda environment (Python 3.10 — required by Isaac Sim 4.5.0) ──
RUN conda create -n behavior python=3.10 -y && conda clean -afy
SHELL ["conda", "run", "-n", "behavior", "/bin/bash", "-c"]

# ── Pre-reqs the v3.7.2 setup.sh enforces before installing OmniGibson ─
RUN uv pip install --no-cache-dir "numpy<2" "setuptools<=79"

# ── PyTorch 2.6.0 + CUDA 12.4 (matches BEHAVIOR-1K v3.7.2 setup.sh) ─
RUN uv pip install --no-cache-dir \
"torch==2.6.0" "torchvision==0.21.0" "torchaudio==2.6.0" \
--index-url https://download.pytorch.org/whl/cu124

# ── Isaac Sim 4.5.0 from the NVIDIA pip index ───────────────────────
# Full package list (26 wheels) mirrors v3.7.2 setup.sh `install_isaac_packages`.
# Installing only the metapackage (isaacsim) leaves
# `isaacsim.simulation_app` unimportable at runtime.
RUN uv pip install --no-cache-dir \
"omniverse-kit==106.5.0.162521" \
"isaacsim-kernel==4.5.0.0" \
"isaacsim-app==4.5.0.0" \
"isaacsim-core==4.5.0.0" \
"isaacsim-gui==4.5.0.0" \
"isaacsim-utils==4.5.0.0" \
"isaacsim-storage==4.5.0.0" \
"isaacsim-asset==4.5.0.0" \
"isaacsim-sensor==4.5.0.0" \
"isaacsim-robot-motion==4.5.0.0" \
"isaacsim-robot==4.5.0.0" \
"isaacsim-benchmark==4.5.0.0" \
"isaacsim-code-editor==4.5.0.0" \
"isaacsim-ros1==4.5.0.0" \
"isaacsim-cortex==4.5.0.0" \
"isaacsim-example==4.5.0.0" \
"isaacsim-replicator==4.5.0.0" \
"isaacsim-rl==4.5.0.0" \
"isaacsim-robot-setup==4.5.0.0" \
"isaacsim-ros2==4.5.0.0" \
"isaacsim-template==4.5.0.0" \
"isaacsim-test==4.5.0.0" \
"isaacsim==4.5.0.0" \
"isaacsim-extscache-physics==4.5.0.0" \
"isaacsim-extscache-kit==4.5.0.0" \
"isaacsim-extscache-kit-sdk==4.5.0.0" \
--extra-index-url https://pypi.nvidia.com

# Fix the bundled-websockets conflict the v3.7.2 setup.sh patches:
# Isaac Sim's pip_prebundle/websockets shadows our model-server websockets.
# The site-packages path is deterministic, so a plain `find` does the job
# without booting isaacsim (which can't import in a non-GPU build context).
RUN find /opt/conda/envs/behavior/lib/python3.10/site-packages/isaacsim/extscache \
-type d -name websockets -path "*/pip_prebundle/*" \
-exec rm -rf {} + 2>/dev/null || true

# ── Clone BEHAVIOR-1K (OmniGibson + bddl3 + joylo/gello) ───────────
# Use plain `pip install -e` (not `uv pip install -e`): BEHAVIOR-1K's
# legacy setuptools layouts (bddl3, OmniGibson, joylo) are not PEP 660
# compliant in a way uv accepts.
ARG BEHAVIOR1K_REF=v3.7.2
RUN git clone --depth 1 --branch ${BEHAVIOR1K_REF} \
https://github.com/StanfordVL/BEHAVIOR-1K.git /app/BEHAVIOR-1K
RUN cd /app/BEHAVIOR-1K && pip install --no-cache-dir -e ./bddl3
RUN cd /app/BEHAVIOR-1K && pip install --no-cache-dir -e "./OmniGibson[eval]"
RUN cd /app/BEHAVIOR-1K && pip install --no-cache-dir -e ./joylo
# Match setup.sh: cffi must be force-reinstalled to 1.17.1 (Isaac Sim
# bundles a build that conflicts with the conda libffi otherwise).
RUN pip install --no-cache-dir --force-reinstall cffi==1.17.1
# OmniGibson + lerobot transitive deps drag numpy back up to 2.x even
# though the early pre-req step pinned <2. Isaac Sim's bundled OGN
# nodes still call np.float_ (removed in numpy 2.0) and crash at scene
# init. Force-downgrade at the very end with --no-deps so we don't
# disturb other resolved versions.
RUN pip install --no-cache-dir --no-deps "numpy<2"
RUN rm -rf /app/BEHAVIOR-1K/.git

# ── Install evaluation harness ─────────────────────────────────────
WORKDIR /workspace
COPY pyproject.toml README.md ./
COPY src/ src/
ARG HARNESS_VERSION=0.0.0
ENV SETUPTOOLS_SCM_PRETEND_VERSION=${HARNESS_VERSION}
RUN uv pip install --no-cache-dir -e .
COPY configs/ configs/

ENTRYPOINT ["conda", "run", "--no-capture-output", "-n", "behavior", "vla-eval"]
CMD ["run", "--config", "/workspace/configs/behavior1k_eval.yaml"]
Loading
Loading