allenai · MilkClouds · Apr 30, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/.claude/skills/add-benchmark/SKILL.md b/.claude/skills/add-benchmark/SKILL.md
@@ -121,6 +121,34 @@ class MyBenchmark(StepBenchmark):
 - **Image preprocessing**: Handle non-standard images (flipped, wrong resolution) in `make_obs()`.
 - **EGL headless rendering**: Add `os.environ.setdefault("PYOPENGL_PLATFORM", "egl")` at module top if the sim uses OpenGL.
 
+### Optional: external dataset acquisition
+
+If the benchmark needs licence-restricted scene/data files that can't ship in the docker image (e.g. ToS-gated downloads), do the lazy fetch inside `_init_*()` / `reset()` using the shared primitives in `vla_eval.dirs`:
+
+```python
+from vla_eval.dirs import assets_cache, ensure_license
+
+def _ensure_assets(self, data_path: Path) -> None:
+    if (data_path / "ready_marker").exists():
+        return
+    ensure_license(
+        "my-dataset-tos",                # also accepts via --accept-license <id>
+        url="https://example.com/license",
+        description="My benchmark dataset ToS (~N GiB).",
+    )
+    data_path.mkdir(parents=True, exist_ok=True)
+    # ... download into data_path with whatever helper your sim provides
+```
+
+`ensure_license` reads stdin in interactive contexts and falls back to the `VLA_EVAL_ACCEPTED_LICENSES` env var (forwarded by `vla-eval run --accept-license <id>`). The eval YAML's volume mount should resolve the host path with the same XDG-aware precedence so `vla-eval run` and the in-container fetch agree:
+
+```yaml
+volumes:
+  - "${oc.env:VLA_EVAL_ASSETS_CACHE,${oc.env:VLA_EVAL_HOME,${oc.env:XDG_CACHE_HOME,${oc.env:HOME}/.cache}/vla-eval}/assets}/<bench>:<container_data_path>"
+```
+
+Reference: `Behavior1KBenchmark._ensure_assets()` in `benchmarks/behavior1k/benchmark.py`.
+
 ## 3. Create config YAML
 
 Create `configs/<name>_eval.yaml`:
@@ -186,6 +214,12 @@ vla-eval test --validate                      # validate all config import strin
 vla-eval test -c configs/<name>_eval.yaml     # smoke-test (1 episode, EchoModelServer, no GPU needed — requires Docker + image)
 ```
 
+**Don't add `tests/test_<name>_benchmark.py` with mocked sim modules.**
+`tests/` is for harness mechanics, not per-sim integration.  Fake
+`omnigibson` / `sapien` / `mujoco` modules drift from upstream each
+release and miss the real bugs (import paths, action encoding,
+physics determinism).  Verify via the smoke test above.
+
 ## Reference implementations
 
 | Benchmark | File | Key patterns |

diff --git a/.claude/skills/add-model-server/SKILL.md b/.claude/skills/add-model-server/SKILL.md
@@ -224,6 +224,13 @@ make test                                             # existing tests still pas
 vla-eval test -c configs/model_servers/<name>.yaml    # smoke-test (starts server, sends dummy obs, checks response — requires uv + GPU + model weights)
 ```
 
+**Don't add `tests/test_<name>_server.py` with mocked model libraries.**
+`tests/` is for harness mechanics, not per-model integration.  Fake
+`transformers` / `torch.nn` / custom inference libs drift from upstream
+each release and miss the real bugs (tokenizer versions,
+checkpoint-format drift, action denormalisation).  Verify via the
+smoke test above.
+
 ## Reference implementations
 
 | Model | File | Key patterns |

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -69,7 +69,7 @@ Every PR triggers lint, type-check, and test jobs automatically (`.github/workfl
 ```
 src/vla_eval/
 ├── cli/              # CLI entry point (argparse)
-├── benchmarks/       # Benchmark adapters (LIBERO, LIBERO-Pro, CALVIN, ManiSkill2, SimplerEnv, RoboCasa, VLABench, MIKASA-Robo, RoboTwin, RLBench, RoboCerebra)
+├── benchmarks/       # Benchmark adapters (LIBERO + LIBERO-Pro/Plus/Mem, CALVIN, ManiSkill2, SimplerEnv, RoboCasa, VLABench, MIKASA-Robo, RoboTwin, RLBench, RoboCerebra, RoboMME, MolmoSpaces, Kinetix, BEHAVIOR-1K)
 ├── model_servers/    # Model server ABCs, utilities, and implementations
 ├── runners/          # Episode execution loops (sync, async)
 ├── results/          # Result collection and shard merging

diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@
 
 | | |
 |:--|:--|
-| **Benchmarks** | [![LIBERO](https://img.shields.io/badge/LIBERO-✓-teal)](configs/libero_all.yaml) [![SimplerEnv](https://img.shields.io/badge/SimplerEnv-✓-teal)](configs/simpler_all_tasks.yaml) [![CALVIN](https://img.shields.io/badge/CALVIN-✓-teal)](configs/calvin_eval.yaml) [![ManiSkill2](https://img.shields.io/badge/ManiSkill2-◇-blue)](configs/maniskill2_eval.yaml) [![LIBERO-Pro](https://img.shields.io/badge/LIBERO--Pro-◇-blue)](configs/libero_pro_eval.yaml) [![LIBERO-Plus](https://img.shields.io/badge/LIBERO--Plus-✓-teal)](configs/libero_plus_spatial.yaml) [![RoboCasa](https://img.shields.io/badge/RoboCasa-◇-blue)](configs/robocasa_eval.yaml) [![VLABench](https://img.shields.io/badge/VLABench-◇-blue)](configs/vlabench_eval.yaml) [![MIKASA-Robo](https://img.shields.io/badge/MIKASA--Robo-◇-blue)](configs/mikasa_eval.yaml) [![RoboTwin](https://img.shields.io/badge/RoboTwin-◇-blue)](configs/robotwin_eval.yaml) [![RLBench](https://img.shields.io/badge/RLBench-◇-blue)](configs/rlbench_eval.yaml) [![RoboCerebra](https://img.shields.io/badge/RoboCerebra-◇-blue)](configs/robocerebra_eval.yaml) [![LIBERO-Mem](https://img.shields.io/badge/LIBERO--Mem-◇-blue)](configs/libero_mem.yaml) ![BEHAVIOR-1K](https://img.shields.io/badge/BEHAVIOR--1K-·-lightgrey) [![Kinetix](https://img.shields.io/badge/Kinetix-◇-blue)](configs/kinetix_eval.yaml) [![RoboMME](https://img.shields.io/badge/RoboMME-✓-teal)](configs/robomme_eval.yaml) [![MolmoSpaces-Bench](https://img.shields.io/badge/MolmoSpaces--Bench-✓-teal)](configs/molmospaces_pick_and_place.yaml) ![FurnitureBench](https://img.shields.io/badge/FurnitureBench-·-lightgrey) |
+| **Benchmarks** | [![LIBERO](https://img.shields.io/badge/LIBERO-✓-teal)](configs/libero_all.yaml) [![SimplerEnv](https://img.shields.io/badge/SimplerEnv-✓-teal)](configs/simpler_all_tasks.yaml) [![CALVIN](https://img.shields.io/badge/CALVIN-✓-teal)](configs/calvin_eval.yaml) [![ManiSkill2](https://img.shields.io/badge/ManiSkill2-◇-blue)](configs/maniskill2_eval.yaml) [![LIBERO-Pro](https://img.shields.io/badge/LIBERO--Pro-◇-blue)](configs/libero_pro_eval.yaml) [![LIBERO-Plus](https://img.shields.io/badge/LIBERO--Plus-✓-teal)](configs/libero_plus_spatial.yaml) [![RoboCasa](https://img.shields.io/badge/RoboCasa-◇-blue)](configs/robocasa_eval.yaml) [![VLABench](https://img.shields.io/badge/VLABench-◇-blue)](configs/vlabench_eval.yaml) [![MIKASA-Robo](https://img.shields.io/badge/MIKASA--Robo-◇-blue)](configs/mikasa_eval.yaml) [![RoboTwin](https://img.shields.io/badge/RoboTwin-◇-blue)](configs/robotwin_eval.yaml) [![RLBench](https://img.shields.io/badge/RLBench-◇-blue)](configs/rlbench_eval.yaml) [![RoboCerebra](https://img.shields.io/badge/RoboCerebra-◇-blue)](configs/robocerebra_eval.yaml) [![LIBERO-Mem](https://img.shields.io/badge/LIBERO--Mem-◇-blue)](configs/libero_mem.yaml) [![BEHAVIOR-1K](https://img.shields.io/badge/BEHAVIOR--1K-◇-blue)](configs/behavior1k_eval.yaml) [![Kinetix](https://img.shields.io/badge/Kinetix-◇-blue)](configs/kinetix_eval.yaml) [![RoboMME](https://img.shields.io/badge/RoboMME-✓-teal)](configs/robomme_eval.yaml) [![MolmoSpaces-Bench](https://img.shields.io/badge/MolmoSpaces--Bench-✓-teal)](configs/molmospaces_pick_and_place.yaml) ![FurnitureBench](https://img.shields.io/badge/FurnitureBench-·-lightgrey) |
 | **Models (official)** | [![OpenVLA](https://img.shields.io/badge/OpenVLA-✓-8B5CF6)](configs/model_servers/openvla.yaml) [![π₀](https://img.shields.io/badge/π₀-✓-8B5CF6)](configs/model_servers/pi0_libero.yaml) [![π₀-FAST](https://img.shields.io/badge/π₀--FAST-✓-8B5CF6)](configs/model_servers/pi0_libero.yaml) [![GR00T N1.6](https://img.shields.io/badge/GR00T_N1.6-✓-8B5CF6)](configs/model_servers/groot.yaml) [![OFT](https://img.shields.io/badge/OFT-✓-8B5CF6)](configs/model_servers/oft_libero.yaml) [![X-VLA](https://img.shields.io/badge/X--VLA-✓-8B5CF6)](configs/model_servers/xvla_libero.yaml) [![CogACT](https://img.shields.io/badge/CogACT-◇-blue)](configs/model_servers/cogact.yaml) [![RTC](https://img.shields.io/badge/RTC-◇-blue)](configs/model_servers/rtc_kinetix.yaml) [![VLANeXt](https://img.shields.io/badge/VLANeXt-✓-8B5CF6)](configs/model_servers/vlanext/libero_spatial.yaml) [![MolmoBot](https://img.shields.io/badge/MolmoBot-✓-8B5CF6)](configs/model_servers/molmobot/droid.yaml) ![MemVLA](https://img.shields.io/badge/MemVLA-·-lightgrey) |
 | **Models ([dexbotic](https://github.com/dexmal/dexbotic))** ![stars](https://img.shields.io/github/stars/dexmal/dexbotic?style=social) | [![DB-CogACT](https://img.shields.io/badge/DB--CogACT-✓-8B5CF6)](configs/model_servers/dexbotic_cogact_libero.yaml) |
 | **Models ([starVLA](https://github.com/starVLA/starVLA))** ![stars](https://img.shields.io/github/stars/starVLA/starVLA?style=social) | [![QwenGR00T](https://img.shields.io/badge/QwenGR00T-✓-8B5CF6)](configs/model_servers/starvla_groot_simpler.yaml) [![QwenOFT](https://img.shields.io/badge/QwenOFT-✓-8B5CF6)](configs/model_servers/starvla_oft_simpler.yaml) [![QwenPI](https://img.shields.io/badge/QwenPI-◇-blue)](configs/model_servers/starvla_pi_simpler.yaml) [![QwenFAST](https://img.shields.io/badge/QwenFAST-✓-8B5CF6)](configs/model_servers/starvla_fast_simpler.yaml) |
@@ -150,7 +150,7 @@ All benchmark environments are packaged as standalone Docker images based on `ba
 | Image | Size | Benchmark | Python | Base |
 |-------|------|-----------|--------|------|
 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) | 3.3 GB | — | — | `nvidia/cuda:12.1.1-runtime-ubuntu22.04` |
-| [`rlbench`](https://ghcr.io/allenai/vla-evaluation-harness/rlbench) | 4.7 GB | RLBench | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
+| `rlbench` 🔒 | 4.7 GB | RLBench | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`simpler`](https://ghcr.io/allenai/vla-evaluation-harness/simpler) | 4.9 GB | SimplerEnv | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`libero`](https://ghcr.io/allenai/vla-evaluation-harness/libero) | 6.0 GB | LIBERO | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`libero-pro`](https://ghcr.io/allenai/vla-evaluation-harness/libero-pro) | 6.2 GB | LIBERO-Pro | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
@@ -163,10 +163,13 @@ All benchmark environments are packaged as standalone Docker images based on `ba
 | [`libero-plus`](https://ghcr.io/allenai/vla-evaluation-harness/libero-plus) | 14.8 GB | LIBERO-Plus | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`robomme`](https://ghcr.io/allenai/vla-evaluation-harness/robomme) | 17.0 GB | RoboMME | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`vlabench`](https://ghcr.io/allenai/vla-evaluation-harness/vlabench) | 17.7 GB | VLABench | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
+| `behavior1k` 🔒 | 23.6 GB | BEHAVIOR-1K | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`robotwin`](https://ghcr.io/allenai/vla-evaluation-harness/robotwin) | 28.6 GB | RoboTwin 2.0 | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`molmospaces`](https://ghcr.io/allenai/vla-evaluation-harness/molmospaces) | 31.4 GB | MolmoSpaces-Bench | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`robocasa`](https://ghcr.io/allenai/vla-evaluation-harness/robocasa) | 35.6 GB | RoboCasa | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 
+<sub>🔒 = build-locally only; the Dockerfile gates the build behind a licence opt-in (`docker/build.sh <name> --accept-license <name>`) and the image isn't published to ghcr.io.</sub>
+
 **Pull** (recommended):
 
 ```bash
@@ -176,8 +179,9 @@ docker pull ghcr.io/allenai/vla-evaluation-harness/libero:latest
 **Build locally** (see [docker/build.sh](docker/build.sh)):
 
 ```bash
-docker/build.sh          # build all (base first, then benchmarks)
-docker/build.sh libero   # build one
+docker/build.sh                                           # build all (gated images skipped)
+docker/build.sh libero                                    # build one
+docker/build.sh behavior1k --accept-license behavior1k    # build a gated image
 ```
 
 ---

diff --git a/configs/behavior1k_eval.yaml b/configs/behavior1k_eval.yaml
@@ -0,0 +1,44 @@
+# BEHAVIOR-1K (OmniGibson / Isaac Sim) — 50-task household-activity suite.
+#
+# First run prompts on stdin to accept the BEHAVIOR Dataset ToS and then downloads ~35 GiB of OmniGibson
+# scene + task data into the asset cache (``$VLA_EVAL_ASSETS_CACHE`` if set, else ``$VLA_EVAL_HOME/assets``,
+# else ``$XDG_CACHE_HOME/vla-eval/assets``, else ``~/.cache/vla-eval/assets``).  Pass
+# ``--accept-license behavior-dataset-tos`` to skip the prompt in non-interactive contexts (CI, sharded
+# runs).  An NVIDIA GPU with Vulkan + EGL is required.
+server:
+  url: "ws://localhost:8000"
+
+docker:
+  image: ghcr.io/allenai/vla-evaluation-harness/behavior1k:latest
+  env:
+    - "NVIDIA_DRIVER_CAPABILITIES=all"
+    - "OMNIGIBSON_HEADLESS=1"
+    - "OMNI_KIT_ACCEPT_EULA=YES"
+    # Pin Isaac Sim/Vulkan to a single NVIDIA ICD.  Without this both the
+    # base image's baked-in /usr/share/vulkan/icd.d/nvidia_icd.json and
+    # the nvidia-container-toolkit-injected /etc/vulkan/icd.d/nvidia_icd.json
+    # are visible at runtime; that triggers a "Multiple ICDs for the same
+    # GPU" error and a segfault deep in omni.kit.xr on first launch.
+    - "VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json"
+  volumes:
+    # OmniGibson reads ``gm.DATA_PATH=/app/BEHAVIOR-1K/datasets`` at import time.  The host path mirrors
+    # ``vla_eval.dirs.assets_cache``'s precedence so ``vla-eval run`` and the in-container fetch agree.
+    # Mounted writable so the first-run download can populate the cache; subsequent runs are read-only
+    # in practice.
+    - "${oc.env:VLA_EVAL_ASSETS_CACHE,${oc.env:VLA_EVAL_HOME,${oc.env:XDG_CACHE_HOME,${oc.env:HOME}/.cache}/vla-eval}/assets}/behavior1k:/app/BEHAVIOR-1K/datasets"
+
+output_dir: "./results"
+
+benchmarks:
+  - benchmark: "vla_eval.benchmarks.behavior1k.benchmark:Behavior1KBenchmark"
+    subname: turning_on_radio
+    mode: sync
+    episodes_per_task: 1
+    params:
+      tasks:
+        - turning_on_radio
+      partial_scene_load: true
+      send_proprio: false
+      max_steps: 2000
+      task_instance_id: 1
+    action_dim: 23
diff --git a/configs/model_servers/behavior1k/baseline.yaml b/configs/model_servers/behavior1k/baseline.yaml
@@ -0,0 +1,7 @@
+# BEHAVIOR-1K — zero-action baseline (R1Pro 23-D).
+# Mirrors the default LocalPolicy(action_dim=23) baseline used by the
+# official OmniGibson eval script when no policy weights are provided.
+script: "src/vla_eval/model_servers/behavior1k_baseline.py"
+args:
+  action_dim: 23
+  port: 8000
diff --git a/configs/model_servers/behavior1k/demo_replay.yaml b/configs/model_servers/behavior1k/demo_replay.yaml
@@ -0,0 +1,13 @@
+# BEHAVIOR-1K — demo-replay model server (LeRobot v2.1 parquet).
+# Replays the recorded action stream from an annotated human-teleop
+# episode.  Used to verify that the env wiring (action space, success
+# detection, observation cameras) matches the released dataset before
+# touching real model weights.
+#
+# Replace ``demo_path`` with a path to a single-episode parquet file
+# from the BEHAVIOR Dataset's LeRobot v2.1 release, e.g.:
+#   /data/behavior_dataset/turning_on_radio/episode_001.parquet
+script: "src/vla_eval/model_servers/behavior1k_demo_replay.py"
+args:
+  demo_path: "/data/behavior_dataset/turning_on_radio/episode_001.parquet"
+  port: 8000