feat(behavior1k): add BEHAVIOR-1K benchmark integration by MilkClouds · Pull Request #57 · allenai/vla-evaluation-harness

MilkClouds · 2026-04-29T16:20:42Z

Summary

Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark, plus a zero-action baseline and a demo-replay model server used to verify env wiring against the released LeRobot v2.1 trajectories.

Replaces #56 (auto-closed when its stacked base feat/rlbench-license-guard was deleted on merge of #55). Rebased onto main.

What's in here

Benchmark module

src/vla_eval/benchmarks/behavior1k/benchmark.py — Behavior1KBenchmark(StepBenchmark), R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden so reset / step / cleanup run on a worker thread (Isaac Sim's SimulationApp.__init__ calls signal.signal, which has to be monkey-patched while the worker thread is set up — signal.signal assumes the main thread).
Lazy imports for OmniGibson (heavy startup, can't be at module level — registry resolves the class without loading the sim).
gm.HEADLESS=True set before og.launch (required to avoid an XR-extension segfault on the cluster).
task_instance_id accepts int | list[int] | None. Scalar fixes the same instance for every episode (demo-replay use case); a list sweeps episode_idx % len(list) to cover the 50-task × 10-instance challenge protocol.

Model servers

behavior1k_baseline.py — zero-action 23-D baseline. Smoke-test sanity check.
behavior1k_demo_replay.py — plays back recorded actions from a LeRobot v2.1 parquet episode. The step cursor is keyed on (session_id, episode_id) and managed via on_episode_start / on_episode_end so concurrent benchmark sessions on one server don't share state.

Docker

Dockerfile.behavior1k — installs Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-* wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K v3.7.2.
Build is gated behind ARG ACCEPT_NVIDIA_EULA=YES (NVIDIA Omniverse EULA — https://docs.omniverse.nvidia.com/eula/). Surfaced via docker/build.sh --accept-license behavior1k, dispatched through the EULA_GATED map added in docker: opt-in license gate for rlbench builds #55.
Listed under NO_REDIST in docker/push.sh, so the image is build-locally-only.

Configs / docs

configs/behavior1k_eval.yaml — turning_on_radio, task instance 1, max 2000 steps.
configs/model_servers/behavior1k/baseline.yaml — zero-action server.
configs/model_servers/behavior1k/demo_replay.yaml — demo-replay server (parquet path placeholder).
docs/reproductions/behavior1k.md — full repro write-up. Result data archived under docs/reproductions/data/.
README.md — BEHAVIOR-1K badge in the support table promoted from planned (·) to integrated (◇); Docker Images table marks rlbench and behavior1k rows as build-locally with a 🔒 indicator and a one-line caption explaining the licence opt-in.
CONTRIBUTING.md — the "Project Structure" benchmark roster updated to match the actual contents of src/vla_eval/benchmarks/.

Verification

# Demo replay (task = turning_on_radio, instance 1)
vla-eval run -c configs/behavior1k_eval.yaml \
            -m configs/model_servers/behavior1k/demo_replay.yaml
# → success=True, finished at step 1364/2000, wall 2933.8s

# Zero-action baseline
vla-eval run -c configs/behavior1k_eval.yaml \
            -m configs/model_servers/behavior1k/baseline.yaml
# → success=False at max_steps (expected)

The demo-replay success step (1364) falls inside the human-annotated press-skill window [1162, 1434] from the BEHAVIOR Dataset annotations, which gives reasonable confidence that the env wiring (action space, observation cameras, success detection) matches the upstream evaluation.

Skill notes

.claude/skills/add-benchmark and .claude/skills/add-model-server gain a short note not to add tests/test_<name>_benchmark.py or tests/test_<name>_server.py with mocked sim / model libraries. tests/ is for harness mechanics, not per-sim integration; mocked omnigibson / sapien / mujoco modules drift from upstream and miss the real bugs (import paths, action encoding, physics determinism). Verification is done via the smoke-test commands above.

Checklist

I have read the relevant contributing guide (CONTRIBUTING.md)

Code changes:

make check passes (ruff + ty)
make test passes (pytest) — 296 passed, 1 skipped
Smoke-tested affected configs (demo replay + baseline runs above)

Smoke test commands run:

make check
make test
docker/build.sh behavior1k --accept-license behavior1k
vla-eval run -c configs/behavior1k_eval.yaml -m configs/model_servers/behavior1k/baseline.yaml
vla-eval run -c configs/behavior1k_eval.yaml -m configs/model_servers/behavior1k/demo_replay.yaml

Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark. The integration covers the standard StepBenchmark surface plus a demo-replay model server used to verify the dataloader against the released LeRobot v2.1 trajectories. What's added: - ``src/vla_eval/benchmarks/behavior1k/benchmark.py`` Behavior1KBenchmark with the required StepBenchmark methods. R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden so reset/step/cleanup run on a worker thread (Isaac Sim's SimulationApp ``signal.signal`` calls have to be monkey-patched while the worker thread is set up — they assume the main thread). - ``src/vla_eval/model_servers/behavior1k_baseline.py`` Zero-action baseline (smoke-test sanity check). - ``src/vla_eval/model_servers/behavior1k_demo_replay.py`` Plays back the recorded actions from a LeRobot v2.1 parquet episode. Used to verify the env wiring matches the dataset. - ``docker/Dockerfile.behavior1k`` Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-* wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K v3.7.2. Gated behind ``ARG ACCEPT_NVIDIA_EULA=YES`` (NVIDIA Omniverse EULA, see https://docs.omniverse.nvidia.com/eula/). - ``configs/behavior1k_eval.yaml`` — turning_on_radio task instance 1 - ``configs/model_servers/behavior1k/baseline.yaml`` — zero-action server - ``docs/reproductions/behavior1k.md`` — repro write-up + data files The behavior1k entry is registered in ``docker/build.sh`` (gated via ``--accept-license behavior1k``) and listed under ``NO_REDIST`` in ``docker/push.sh`` so the image is built locally only. Verification: - demo-replay on turning_on_radio (task instance 1) → success=True at step 1364/2000 (within the human-annotated press-skill window [1162, 1434] from the BEHAVIOR Dataset annotations). - zero-action baseline → success=False at max_steps (expected). Skill notes (``.claude/skills/add-benchmark`` and ``add-model-server``) gain a short reminder not to add ``tests/test_<name>_benchmark.py`` or ``tests/test_<name>_server.py`` with mocked sim/model libraries — ``tests/`` is for harness mechanics, not per-sim integration, and mocked modules drift from upstream and miss real bugs.

Companion to the baseline.yaml — points the demo-replay server at a LeRobot v2.1 parquet episode. ``demo_path`` is a placeholder; users swap in their own path before running.

CI surfaced three ty errors after the rebase: - ``anyio.to_thread.run_sync(...)`` was unresolved through the module attribute path. Use the same import-as-name style the rest of the codebase already uses (``predict.py``, ``serve.py``, ``rtc.py``). - ``signal.signal = lambda ...`` triggered ``invalid-assignment``. Use ``setattr`` so the rebinding is opaque to the type checker (the runtime behaviour — restoring the handler in ``finally`` — is unchanged). - Drop the leftover ``# type: ignore`` mypy-style pragmas that were carrying the old workarounds; ty doesn't honour them anyway. While here, refresh the docs that mention benchmark coverage: - ``README.md``: BEHAVIOR-1K badge promoted from ``planned`` to ``integrated``; rlbench dropped from the registry-pulled image table; new "Build-locally images" note covering rlbench and behavior1k; build-script example shows ``--accept-license``. - ``CONTRIBUTING.md``: integrated-benchmark roster updated to match the actual contents of ``src/vla_eval/benchmarks/`` (was missing LIBERO-Plus/Mem, RoboMME, MolmoSpaces, Kinetix; now also adds BEHAVIOR-1K).

Build-locally images (rlbench, behavior1k) now appear in both the top-of-readme support table and the Docker Images table, with a 🔒 marker indicating they're not pulled from ghcr.io and require an explicit licence opt-in. - Top support table: 🔒 appended after the rlbench and behavior1k badges. Status legend gains a fourth entry explaining 🔒. - Docker Images table: rlbench is restored (was dropped in the prior pass), behavior1k is added at its 23.6 GB position. For both, the Image column shows the name without a ghcr.io link, and the row carries 🔒. - Replaces the earlier "Build-locally images" paragraph with a single caption under the table that explains the marker.

Reverts the 🔒 markers added next to the RLBench / BEHAVIOR-1K badges in the top support table, and the matching legend entry. Build-mechanism details belong in the Docker Images table further down — the support table just tracks integration / reproduction status.

Three issues from review: - ``Behavior1KBenchmark.task_instance_id`` was set once at construction and never varied, so ``episodes_per_task > 1`` runs reloaded the same TRO state every episode (and aggregate scores could not match the 50-task × 10-instance challenge protocol). Accept ``int | list[int] | None`` and index by ``task["episode_idx"]`` cyclically when a list is given; the scalar form preserves the demo-replay use case. - ``Behavior1KDemoReplayModelServer`` kept a single ``_current_episode_id`` / ``_step_idx`` for the whole process, so two concurrent benchmark sessions on one server would race and consume a mixed action stream. Key the cursor on ``(session_id, episode_id)``, initialise in ``on_episode_start`` and free in ``on_episode_end`` so the dict stays bounded. - ``docs/reproductions/behavior1k.md`` build command did not pass ``--accept-license behavior1k``, so the new gated build skipped the image and the next step failed with "image not found". Updated the command and added the licence URL inline.

…ctions Re-orient PR #58 around a smaller infrastructure change. The ``vla-eval data fetch`` subcommand + ``DataRequirement`` declarative metadata layer were over-built for the actual lifecycle: the license-acceptance handshake doesn't need to be a separate pre-flight step; it can be runtime, prompted on first need, just like model-server git clones already do. Moving the licence confirmation to runtime collapses the asymmetry between benchmark-asset fetch and model-server clone fetch — both become lazy, both go through the same primitives. This commit removes the abstraction. The next commit adds the runtime-licence flow and the unified host-cache resolver. Removed: - ``src/vla_eval/cli/cmd_data.py`` (the ``vla-eval data fetch`` subcommand and its docker-side fetch dispatch). - ``DataRequirement`` dataclass and ``Benchmark.data_requirements`` classmethod on ``src/vla_eval/benchmarks/base.py``. - ``Behavior1KBenchmark.data_requirements`` method. - ``cmd_data.register(sub)`` wiring in ``cli/main.py``. Reverted to the PR #57 baseline: - ``configs/behavior1k_eval.yaml`` — the data-fetch comment block and the OmegaConf volume interpolation; the next commit puts the interpolation back in extended XDG-aware form. - ``docs/reproductions/behavior1k.md`` step 2. - ``.claude/skills/add-benchmark/SKILL.md`` ``data_requirements`` section. Kept (independent improvements that survive this rewrite): - ``cli/_console.py``, ``cli/_docker.py`` (helper hoists). - ``cli/config_loader.py`` always-on OmegaConf interpolation. - ``Behavior1KBenchmark.task_instance_id`` per-episode sweep. - Demo-replay per-(session, episode) cursor + ``on_episode_start`` fail-loud hook. - ``Behavior1KBenchmark.get_metadata`` declaring ``action_dim=23``. - README "Build-locally images" caption + 🔒 marker on rlbench/ behavior1k rows; CONTRIBUTING benchmark roster refresh.

MilkClouds added 6 commits April 29, 2026 16:19

configs(behavior1k): add demo_replay model-server config

bfb7988

Companion to the baseline.yaml — points the demo-replay server at a LeRobot v2.1 parquet episode. ``demo_path`` is a placeholder; users swap in their own path before running.

MilkClouds mentioned this pull request Apr 29, 2026

feat(dirs): unified XDG asset cache + runtime licence confirmation #58

Merged

4 tasks

MilkClouds merged commit 7c8afa9 into main Apr 30, 2026
6 checks passed

MilkClouds deleted the feat/behavior1k-integration branch April 30, 2026 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(behavior1k): add BEHAVIOR-1K benchmark integration#57

feat(behavior1k): add BEHAVIOR-1K benchmark integration#57
MilkClouds merged 6 commits intomainfrom
feat/behavior1k-integration

MilkClouds commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MilkClouds commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

Verification

Skill notes

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MilkClouds commented Apr 29, 2026 •

edited

Loading