allenai · MilkClouds · Apr 30, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/.claude/skills/add-benchmark/SKILL.md b/.claude/skills/add-benchmark/SKILL.md
@@ -186,6 +186,12 @@ vla-eval test --validate                      # validate all config import strin
 vla-eval test -c configs/<name>_eval.yaml     # smoke-test (1 episode, EchoModelServer, no GPU needed — requires Docker + image)
 ```
 
+**Don't add `tests/test_<name>_benchmark.py` with mocked sim modules.**
+`tests/` is for harness mechanics, not per-sim integration.  Fake
+`omnigibson` / `sapien` / `mujoco` modules drift from upstream each
+release and miss the real bugs (import paths, action encoding,
+physics determinism).  Verify via the smoke test above.
+
 ## Reference implementations
 
 | Benchmark | File | Key patterns |

diff --git a/.claude/skills/add-model-server/SKILL.md b/.claude/skills/add-model-server/SKILL.md
@@ -224,6 +224,13 @@ make test                                             # existing tests still pas
 vla-eval test -c configs/model_servers/<name>.yaml    # smoke-test (starts server, sends dummy obs, checks response — requires uv + GPU + model weights)
 ```
 
+**Don't add `tests/test_<name>_server.py` with mocked model libraries.**
+`tests/` is for harness mechanics, not per-model integration.  Fake
+`transformers` / `torch.nn` / custom inference libs drift from upstream
+each release and miss the real bugs (tokenizer versions,
+checkpoint-format drift, action denormalisation).  Verify via the
+smoke test above.
+
 ## Reference implementations
 
 | Model | File | Key patterns |

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -69,7 +69,7 @@ Every PR triggers lint, type-check, and test jobs automatically (`.github/workfl
 ```
 src/vla_eval/
 ├── cli/              # CLI entry point (argparse)
-├── benchmarks/       # Benchmark adapters (LIBERO, LIBERO-Pro, CALVIN, ManiSkill2, SimplerEnv, RoboCasa, VLABench, MIKASA-Robo, RoboTwin, RLBench, RoboCerebra)
+├── benchmarks/       # Benchmark adapters (LIBERO + LIBERO-Pro/Plus/Mem, CALVIN, ManiSkill2, SimplerEnv, RoboCasa, VLABench, MIKASA-Robo, RoboTwin, RLBench, RoboCerebra, RoboMME, MolmoSpaces, Kinetix, BEHAVIOR-1K)
 ├── model_servers/    # Model server ABCs, utilities, and implementations
 ├── runners/          # Episode execution loops (sync, async)
 ├── results/          # Result collection and shard merging

diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@
 
 | | |
 |:--|:--|
-| **Benchmarks** | [![LIBERO](https://img.shields.io/badge/LIBERO-✓-teal)](configs/libero_all.yaml) [![SimplerEnv](https://img.shields.io/badge/SimplerEnv-✓-teal)](configs/simpler_all_tasks.yaml) [![CALVIN](https://img.shields.io/badge/CALVIN-✓-teal)](configs/calvin_eval.yaml) [![ManiSkill2](https://img.shields.io/badge/ManiSkill2-◇-blue)](configs/maniskill2_eval.yaml) [![LIBERO-Pro](https://img.shields.io/badge/LIBERO--Pro-◇-blue)](configs/libero_pro_eval.yaml) [![LIBERO-Plus](https://img.shields.io/badge/LIBERO--Plus-✓-teal)](configs/libero_plus_spatial.yaml) [![RoboCasa](https://img.shields.io/badge/RoboCasa-◇-blue)](configs/robocasa_eval.yaml) [![VLABench](https://img.shields.io/badge/VLABench-◇-blue)](configs/vlabench_eval.yaml) [![MIKASA-Robo](https://img.shields.io/badge/MIKASA--Robo-◇-blue)](configs/mikasa_eval.yaml) [![RoboTwin](https://img.shields.io/badge/RoboTwin-◇-blue)](configs/robotwin_eval.yaml) [![RLBench](https://img.shields.io/badge/RLBench-◇-blue)](configs/rlbench_eval.yaml) [![RoboCerebra](https://img.shields.io/badge/RoboCerebra-◇-blue)](configs/robocerebra_eval.yaml) [![LIBERO-Mem](https://img.shields.io/badge/LIBERO--Mem-◇-blue)](configs/libero_mem.yaml) ![BEHAVIOR-1K](https://img.shields.io/badge/BEHAVIOR--1K-·-lightgrey) [![Kinetix](https://img.shields.io/badge/Kinetix-◇-blue)](configs/kinetix_eval.yaml) [![RoboMME](https://img.shields.io/badge/RoboMME-✓-teal)](configs/robomme_eval.yaml) [![MolmoSpaces-Bench](https://img.shields.io/badge/MolmoSpaces--Bench-✓-teal)](configs/molmospaces_pick_and_place.yaml) ![FurnitureBench](https://img.shields.io/badge/FurnitureBench-·-lightgrey) |
+| **Benchmarks** | [![LIBERO](https://img.shields.io/badge/LIBERO-✓-teal)](configs/libero_all.yaml) [![SimplerEnv](https://img.shields.io/badge/SimplerEnv-✓-teal)](configs/simpler_all_tasks.yaml) [![CALVIN](https://img.shields.io/badge/CALVIN-✓-teal)](configs/calvin_eval.yaml) [![ManiSkill2](https://img.shields.io/badge/ManiSkill2-◇-blue)](configs/maniskill2_eval.yaml) [![LIBERO-Pro](https://img.shields.io/badge/LIBERO--Pro-◇-blue)](configs/libero_pro_eval.yaml) [![LIBERO-Plus](https://img.shields.io/badge/LIBERO--Plus-✓-teal)](configs/libero_plus_spatial.yaml) [![RoboCasa](https://img.shields.io/badge/RoboCasa-◇-blue)](configs/robocasa_eval.yaml) [![VLABench](https://img.shields.io/badge/VLABench-◇-blue)](configs/vlabench_eval.yaml) [![MIKASA-Robo](https://img.shields.io/badge/MIKASA--Robo-◇-blue)](configs/mikasa_eval.yaml) [![RoboTwin](https://img.shields.io/badge/RoboTwin-◇-blue)](configs/robotwin_eval.yaml) [![RLBench](https://img.shields.io/badge/RLBench-◇-blue)](configs/rlbench_eval.yaml) [![RoboCerebra](https://img.shields.io/badge/RoboCerebra-◇-blue)](configs/robocerebra_eval.yaml) [![LIBERO-Mem](https://img.shields.io/badge/LIBERO--Mem-◇-blue)](configs/libero_mem.yaml) [![BEHAVIOR-1K](https://img.shields.io/badge/BEHAVIOR--1K-◇-blue)](configs/behavior1k_eval.yaml) [![Kinetix](https://img.shields.io/badge/Kinetix-◇-blue)](configs/kinetix_eval.yaml) [![RoboMME](https://img.shields.io/badge/RoboMME-✓-teal)](configs/robomme_eval.yaml) [![MolmoSpaces-Bench](https://img.shields.io/badge/MolmoSpaces--Bench-✓-teal)](configs/molmospaces_pick_and_place.yaml) ![FurnitureBench](https://img.shields.io/badge/FurnitureBench-·-lightgrey) |
 | **Models (official)** | [![OpenVLA](https://img.shields.io/badge/OpenVLA-✓-8B5CF6)](configs/model_servers/openvla.yaml) [![π₀](https://img.shields.io/badge/π₀-✓-8B5CF6)](configs/model_servers/pi0_libero.yaml) [![π₀-FAST](https://img.shields.io/badge/π₀--FAST-✓-8B5CF6)](configs/model_servers/pi0_libero.yaml) [![GR00T N1.6](https://img.shields.io/badge/GR00T_N1.6-✓-8B5CF6)](configs/model_servers/groot.yaml) [![OFT](https://img.shields.io/badge/OFT-✓-8B5CF6)](configs/model_servers/oft_libero.yaml) [![X-VLA](https://img.shields.io/badge/X--VLA-✓-8B5CF6)](configs/model_servers/xvla_libero.yaml) [![CogACT](https://img.shields.io/badge/CogACT-◇-blue)](configs/model_servers/cogact.yaml) [![RTC](https://img.shields.io/badge/RTC-◇-blue)](configs/model_servers/rtc_kinetix.yaml) [![VLANeXt](https://img.shields.io/badge/VLANeXt-✓-8B5CF6)](configs/model_servers/vlanext/libero_spatial.yaml) [![MolmoBot](https://img.shields.io/badge/MolmoBot-✓-8B5CF6)](configs/model_servers/molmobot/droid.yaml) ![MemVLA](https://img.shields.io/badge/MemVLA-·-lightgrey) |
 | **Models ([dexbotic](https://github.com/dexmal/dexbotic))** ![stars](https://img.shields.io/github/stars/dexmal/dexbotic?style=social) | [![DB-CogACT](https://img.shields.io/badge/DB--CogACT-✓-8B5CF6)](configs/model_servers/dexbotic_cogact_libero.yaml) |
 | **Models ([starVLA](https://github.com/starVLA/starVLA))** ![stars](https://img.shields.io/github/stars/starVLA/starVLA?style=social) | [![QwenGR00T](https://img.shields.io/badge/QwenGR00T-✓-8B5CF6)](configs/model_servers/starvla_groot_simpler.yaml) [![QwenOFT](https://img.shields.io/badge/QwenOFT-✓-8B5CF6)](configs/model_servers/starvla_oft_simpler.yaml) [![QwenPI](https://img.shields.io/badge/QwenPI-◇-blue)](configs/model_servers/starvla_pi_simpler.yaml) [![QwenFAST](https://img.shields.io/badge/QwenFAST-✓-8B5CF6)](configs/model_servers/starvla_fast_simpler.yaml) |
@@ -150,7 +150,7 @@ All benchmark environments are packaged as standalone Docker images based on `ba
 | Image | Size | Benchmark | Python | Base |
 |-------|------|-----------|--------|------|
 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) | 3.3 GB | — | — | `nvidia/cuda:12.1.1-runtime-ubuntu22.04` |
-| [`rlbench`](https://ghcr.io/allenai/vla-evaluation-harness/rlbench) | 4.7 GB | RLBench | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
+| `rlbench` 🔒 | 4.7 GB | RLBench | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`simpler`](https://ghcr.io/allenai/vla-evaluation-harness/simpler) | 4.9 GB | SimplerEnv | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`libero`](https://ghcr.io/allenai/vla-evaluation-harness/libero) | 6.0 GB | LIBERO | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`libero-pro`](https://ghcr.io/allenai/vla-evaluation-harness/libero-pro) | 6.2 GB | LIBERO-Pro | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
@@ -163,10 +163,13 @@ All benchmark environments are packaged as standalone Docker images based on `ba
 | [`libero-plus`](https://ghcr.io/allenai/vla-evaluation-harness/libero-plus) | 14.8 GB | LIBERO-Plus | 3.8 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`robomme`](https://ghcr.io/allenai/vla-evaluation-harness/robomme) | 17.0 GB | RoboMME | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`vlabench`](https://ghcr.io/allenai/vla-evaluation-harness/vlabench) | 17.7 GB | VLABench | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
+| `behavior1k` 🔒 | 23.6 GB | BEHAVIOR-1K | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`robotwin`](https://ghcr.io/allenai/vla-evaluation-harness/robotwin) | 28.6 GB | RoboTwin 2.0 | 3.10 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`molmospaces`](https://ghcr.io/allenai/vla-evaluation-harness/molmospaces) | 31.4 GB | MolmoSpaces-Bench | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 | [`robocasa`](https://ghcr.io/allenai/vla-evaluation-harness/robocasa) | 35.6 GB | RoboCasa | 3.11 | [`base`](https://ghcr.io/allenai/vla-evaluation-harness/base) |
 
+<sub>🔒 = build-locally only; the Dockerfile gates the build behind a licence opt-in (`docker/build.sh <name> --accept-license <name>`) and the image isn't published to ghcr.io.</sub>
+
 **Pull** (recommended):
 
 ```bash
@@ -176,8 +179,9 @@ docker pull ghcr.io/allenai/vla-evaluation-harness/libero:latest
 **Build locally** (see [docker/build.sh](docker/build.sh)):
 
 ```bash
-docker/build.sh          # build all (base first, then benchmarks)
-docker/build.sh libero   # build one
+docker/build.sh                                           # build all (gated images skipped)
+docker/build.sh libero                                    # build one
+docker/build.sh behavior1k --accept-license behavior1k    # build a gated image
 ```
 
 ---

diff --git a/configs/behavior1k_eval.yaml b/configs/behavior1k_eval.yaml
@@ -0,0 +1,43 @@
+# BEHAVIOR-1K (OmniGibson / Isaac Sim) — 50-task household-activity suite.
+#
+# Before running, edit the dataset volume below to point at your local
+# BEHAVIOR-1K data directory (the one populated by the three
+# ``download_*`` calls documented in docs/reproductions/behavior1k.md).
+# An NVIDIA GPU with Vulkan + EGL is required.
+server:
+  url: "ws://localhost:8000"
+
+docker:
+  image: ghcr.io/allenai/vla-evaluation-harness/behavior1k:latest
+  env:
+    - "NVIDIA_DRIVER_CAPABILITIES=all"
+    - "OMNIGIBSON_HEADLESS=1"
+    - "OMNI_KIT_ACCEPT_EULA=YES"
+    # Pin Isaac Sim/Vulkan to a single NVIDIA ICD.  Without this both the
+    # base image's baked-in /usr/share/vulkan/icd.d/nvidia_icd.json and
+    # the nvidia-container-toolkit-injected /etc/vulkan/icd.d/nvidia_icd.json
+    # are visible at runtime; that triggers a "Multiple ICDs for the same
+    # GPU" error and a segfault deep in omni.kit.xr on first launch.
+    - "VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json"
+  volumes:
+    # OmniGibson reads gm.DATA_PATH=/app/BEHAVIOR-1K/datasets at import time.
+    # Replace the host side with the directory holding
+    # ``omnigibson-robot-assets/``, ``behavior-1k-assets/``, and
+    # ``2025-challenge-task-instances/``.
+    - "/data/og_data:/app/BEHAVIOR-1K/datasets:ro"
+
+output_dir: "./results"
+
+benchmarks:
+  - benchmark: "vla_eval.benchmarks.behavior1k.benchmark:Behavior1KBenchmark"
+    subname: turning_on_radio
+    mode: sync
+    episodes_per_task: 1
+    params:
+      tasks:
+        - turning_on_radio
+      partial_scene_load: true
+      send_proprio: false
+      max_steps: 2000
+      task_instance_id: 1
+    action_dim: 23
diff --git a/configs/model_servers/behavior1k/baseline.yaml b/configs/model_servers/behavior1k/baseline.yaml
@@ -0,0 +1,7 @@
+# BEHAVIOR-1K — zero-action baseline (R1Pro 23-D).
+# Mirrors the default LocalPolicy(action_dim=23) baseline used by the
+# official OmniGibson eval script when no policy weights are provided.
+script: "src/vla_eval/model_servers/behavior1k_baseline.py"
+args:
+  action_dim: 23
+  port: 8000
diff --git a/configs/model_servers/behavior1k/demo_replay.yaml b/configs/model_servers/behavior1k/demo_replay.yaml
@@ -0,0 +1,13 @@
+# BEHAVIOR-1K — demo-replay model server (LeRobot v2.1 parquet).
+# Replays the recorded action stream from an annotated human-teleop
+# episode.  Used to verify that the env wiring (action space, success
+# detection, observation cameras) matches the released dataset before
+# touching real model weights.
+#
+# Replace ``demo_path`` with a path to a single-episode parquet file
+# from the BEHAVIOR Dataset's LeRobot v2.1 release, e.g.:
+#   /data/behavior_dataset/turning_on_radio/episode_001.parquet
+script: "src/vla_eval/model_servers/behavior1k_demo_replay.py"
+args:
+  demo_path: "/data/behavior_dataset/turning_on_radio/episode_001.parquet"
+  port: 8000
diff --git a/docker/Dockerfile.behavior1k b/docker/Dockerfile.behavior1k
@@ -0,0 +1,120 @@
+# BEHAVIOR-1K — OmniGibson on NVIDIA Isaac Sim (https://behavior.stanford.edu)
+#
+# Heavy image: pulls Isaac Sim wheels (~12 GB) and the BEHAVIOR-1K
+# source tree.  The dataset itself (~10 GB) is NOT baked in; mount it
+# at runtime under /app/BEHAVIOR-1K/datasets.
+#
+# Hardware requirements: NVIDIA GPU (RTX 2070+), 8 GB+ VRAM, Vulkan ICD.
+
+ARG BASE_IMAGE=ghcr.io/allenai/vla-evaluation-harness/base:latest
+FROM ${BASE_IMAGE}
+
+# Build-time license confirmation.  The user must explicitly opt in
+# the same way Stanford's setup.sh requires --accept-nvidia-eula.
+ARG ACCEPT_NVIDIA_EULA=
+RUN if [ "$ACCEPT_NVIDIA_EULA" != "YES" ]; then \
+        echo ""; \
+        echo "============================================================"; \
+        echo "Building BEHAVIOR-1K requires accepting two licenses:"; \
+        echo "  1. NVIDIA Isaac Sim EULA"; \
+        echo "     https://docs.omniverse.nvidia.com/eula/"; \
+        echo "  2. BEHAVIOR Dataset Terms of Service (at runtime, when"; \
+        echo "     you download/mount the encrypted scene+object bundle)"; \
+        echo ""; \
+        echo "Read the EULAs above, then re-run with:"; \
+        echo "  docker build --build-arg ACCEPT_NVIDIA_EULA=YES ..."; \
+        echo "  (or: docker/build.sh behavior1k --accept-nvidia-eula)"; \
+        echo "============================================================"; \
+        exit 1; \
+    fi
+
+ENV OMNIGIBSON_HEADLESS=1 \
+    OMNI_KIT_ACCEPT_EULA=YES \
+    ACCEPT_EULA=Y \
+    PRIVACY_CONSENT=Y
+
+# ── Conda environment (Python 3.10 — required by Isaac Sim 4.5.0) ──
+RUN conda create -n behavior python=3.10 -y && conda clean -afy
+SHELL ["conda", "run", "-n", "behavior", "/bin/bash", "-c"]
+
+# ── Pre-reqs the v3.7.2 setup.sh enforces before installing OmniGibson ─
+RUN uv pip install --no-cache-dir "numpy<2" "setuptools<=79"
+
+# ── PyTorch 2.6.0 + CUDA 12.4 (matches BEHAVIOR-1K v3.7.2 setup.sh) ─
+RUN uv pip install --no-cache-dir \
+        "torch==2.6.0" "torchvision==0.21.0" "torchaudio==2.6.0" \
+        --index-url https://download.pytorch.org/whl/cu124
+
+# ── Isaac Sim 4.5.0 from the NVIDIA pip index ───────────────────────
+# Full package list (26 wheels) mirrors v3.7.2 setup.sh `install_isaac_packages`.
+# Installing only the metapackage (isaacsim) leaves
+# `isaacsim.simulation_app` unimportable at runtime.
+RUN uv pip install --no-cache-dir \
+        "omniverse-kit==106.5.0.162521" \
+        "isaacsim-kernel==4.5.0.0" \
+        "isaacsim-app==4.5.0.0" \
+        "isaacsim-core==4.5.0.0" \
+        "isaacsim-gui==4.5.0.0" \
+        "isaacsim-utils==4.5.0.0" \
+        "isaacsim-storage==4.5.0.0" \
+        "isaacsim-asset==4.5.0.0" \
+        "isaacsim-sensor==4.5.0.0" \
+        "isaacsim-robot-motion==4.5.0.0" \
+        "isaacsim-robot==4.5.0.0" \
+        "isaacsim-benchmark==4.5.0.0" \
+        "isaacsim-code-editor==4.5.0.0" \
+        "isaacsim-ros1==4.5.0.0" \
+        "isaacsim-cortex==4.5.0.0" \
+        "isaacsim-example==4.5.0.0" \
+        "isaacsim-replicator==4.5.0.0" \
+        "isaacsim-rl==4.5.0.0" \
+        "isaacsim-robot-setup==4.5.0.0" \
+        "isaacsim-ros2==4.5.0.0" \
+        "isaacsim-template==4.5.0.0" \
+        "isaacsim-test==4.5.0.0" \
+        "isaacsim==4.5.0.0" \
+        "isaacsim-extscache-physics==4.5.0.0" \
+        "isaacsim-extscache-kit==4.5.0.0" \
+        "isaacsim-extscache-kit-sdk==4.5.0.0" \
+        --extra-index-url https://pypi.nvidia.com
+
+# Fix the bundled-websockets conflict the v3.7.2 setup.sh patches:
+# Isaac Sim's pip_prebundle/websockets shadows our model-server websockets.
+# The site-packages path is deterministic, so a plain `find` does the job
+# without booting isaacsim (which can't import in a non-GPU build context).
+RUN find /opt/conda/envs/behavior/lib/python3.10/site-packages/isaacsim/extscache \
+        -type d -name websockets -path "*/pip_prebundle/*" \
+        -exec rm -rf {} + 2>/dev/null || true
+
+# ── Clone BEHAVIOR-1K (OmniGibson + bddl3 + joylo/gello) ───────────
+# Use plain `pip install -e` (not `uv pip install -e`): BEHAVIOR-1K's
+# legacy setuptools layouts (bddl3, OmniGibson, joylo) are not PEP 660
+# compliant in a way uv accepts.
+ARG BEHAVIOR1K_REF=v3.7.2
+RUN git clone --depth 1 --branch ${BEHAVIOR1K_REF} \
+        https://github.com/StanfordVL/BEHAVIOR-1K.git /app/BEHAVIOR-1K
+RUN cd /app/BEHAVIOR-1K && pip install --no-cache-dir -e ./bddl3
+RUN cd /app/BEHAVIOR-1K && pip install --no-cache-dir -e "./OmniGibson[eval]"
+RUN cd /app/BEHAVIOR-1K && pip install --no-cache-dir -e ./joylo
+# Match setup.sh: cffi must be force-reinstalled to 1.17.1 (Isaac Sim
+# bundles a build that conflicts with the conda libffi otherwise).
+RUN pip install --no-cache-dir --force-reinstall cffi==1.17.1
+# OmniGibson + lerobot transitive deps drag numpy back up to 2.x even
+# though the early pre-req step pinned <2.  Isaac Sim's bundled OGN
+# nodes still call np.float_ (removed in numpy 2.0) and crash at scene
+# init.  Force-downgrade at the very end with --no-deps so we don't
+# disturb other resolved versions.
+RUN pip install --no-cache-dir --no-deps "numpy<2"
+RUN rm -rf /app/BEHAVIOR-1K/.git
+
+# ── Install evaluation harness ─────────────────────────────────────
+WORKDIR /workspace
+COPY pyproject.toml README.md ./
+COPY src/ src/
+ARG HARNESS_VERSION=0.0.0
+ENV SETUPTOOLS_SCM_PRETEND_VERSION=${HARNESS_VERSION}
+RUN uv pip install --no-cache-dir -e .
+COPY configs/ configs/
+
+ENTRYPOINT ["conda", "run", "--no-capture-output", "-n", "behavior", "vla-eval"]
+CMD ["run", "--config", "/workspace/configs/behavior1k_eval.yaml"]