feat(behavior1k): add BEHAVIOR-1K benchmark integration#57
Merged
MilkClouds merged 6 commits intomainfrom Apr 30, 2026
Merged
Conversation
Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark. The integration covers the standard StepBenchmark surface plus a demo-replay model server used to verify the dataloader against the released LeRobot v2.1 trajectories. What's added: - ``src/vla_eval/benchmarks/behavior1k/benchmark.py`` Behavior1KBenchmark with the required StepBenchmark methods. R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden so reset/step/cleanup run on a worker thread (Isaac Sim's SimulationApp ``signal.signal`` calls have to be monkey-patched while the worker thread is set up — they assume the main thread). - ``src/vla_eval/model_servers/behavior1k_baseline.py`` Zero-action baseline (smoke-test sanity check). - ``src/vla_eval/model_servers/behavior1k_demo_replay.py`` Plays back the recorded actions from a LeRobot v2.1 parquet episode. Used to verify the env wiring matches the dataset. - ``docker/Dockerfile.behavior1k`` Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-* wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K v3.7.2. Gated behind ``ARG ACCEPT_NVIDIA_EULA=YES`` (NVIDIA Omniverse EULA, see https://docs.omniverse.nvidia.com/eula/). - ``configs/behavior1k_eval.yaml`` — turning_on_radio task instance 1 - ``configs/model_servers/behavior1k/baseline.yaml`` — zero-action server - ``docs/reproductions/behavior1k.md`` — repro write-up + data files The behavior1k entry is registered in ``docker/build.sh`` (gated via ``--accept-license behavior1k``) and listed under ``NO_REDIST`` in ``docker/push.sh`` so the image is built locally only. Verification: - demo-replay on turning_on_radio (task instance 1) → success=True at step 1364/2000 (within the human-annotated press-skill window [1162, 1434] from the BEHAVIOR Dataset annotations). - zero-action baseline → success=False at max_steps (expected). Skill notes (``.claude/skills/add-benchmark`` and ``add-model-server``) gain a short reminder not to add ``tests/test_<name>_benchmark.py`` or ``tests/test_<name>_server.py`` with mocked sim/model libraries — ``tests/`` is for harness mechanics, not per-sim integration, and mocked modules drift from upstream and miss real bugs.
Companion to the baseline.yaml — points the demo-replay server at a LeRobot v2.1 parquet episode. ``demo_path`` is a placeholder; users swap in their own path before running.
CI surfaced three ty errors after the rebase: - ``anyio.to_thread.run_sync(...)`` was unresolved through the module attribute path. Use the same import-as-name style the rest of the codebase already uses (``predict.py``, ``serve.py``, ``rtc.py``). - ``signal.signal = lambda ...`` triggered ``invalid-assignment``. Use ``setattr`` so the rebinding is opaque to the type checker (the runtime behaviour — restoring the handler in ``finally`` — is unchanged). - Drop the leftover ``# type: ignore`` mypy-style pragmas that were carrying the old workarounds; ty doesn't honour them anyway. While here, refresh the docs that mention benchmark coverage: - ``README.md``: BEHAVIOR-1K badge promoted from ``planned`` to ``integrated``; rlbench dropped from the registry-pulled image table; new "Build-locally images" note covering rlbench and behavior1k; build-script example shows ``--accept-license``. - ``CONTRIBUTING.md``: integrated-benchmark roster updated to match the actual contents of ``src/vla_eval/benchmarks/`` (was missing LIBERO-Plus/Mem, RoboMME, MolmoSpaces, Kinetix; now also adds BEHAVIOR-1K).
Build-locally images (rlbench, behavior1k) now appear in both the top-of-readme support table and the Docker Images table, with a 🔒 marker indicating they're not pulled from ghcr.io and require an explicit licence opt-in. - Top support table: 🔒 appended after the rlbench and behavior1k badges. Status legend gains a fourth entry explaining 🔒. - Docker Images table: rlbench is restored (was dropped in the prior pass), behavior1k is added at its 23.6 GB position. For both, the Image column shows the name without a ghcr.io link, and the row carries 🔒. - Replaces the earlier "Build-locally images" paragraph with a single caption under the table that explains the marker.
Reverts the 🔒 markers added next to the RLBench / BEHAVIOR-1K badges in the top support table, and the matching legend entry. Build-mechanism details belong in the Docker Images table further down — the support table just tracks integration / reproduction status.
Three issues from review: - ``Behavior1KBenchmark.task_instance_id`` was set once at construction and never varied, so ``episodes_per_task > 1`` runs reloaded the same TRO state every episode (and aggregate scores could not match the 50-task × 10-instance challenge protocol). Accept ``int | list[int] | None`` and index by ``task["episode_idx"]`` cyclically when a list is given; the scalar form preserves the demo-replay use case. - ``Behavior1KDemoReplayModelServer`` kept a single ``_current_episode_id`` / ``_step_idx`` for the whole process, so two concurrent benchmark sessions on one server would race and consume a mixed action stream. Key the cursor on ``(session_id, episode_id)``, initialise in ``on_episode_start`` and free in ``on_episode_end`` so the dict stays bounded. - ``docs/reproductions/behavior1k.md`` build command did not pass ``--accept-license behavior1k``, so the new gated build skipped the image and the next step failed with "image not found". Updated the command and added the licence URL inline.
4 tasks
MilkClouds
added a commit
that referenced
this pull request
Apr 30, 2026
…ctions Re-orient PR #58 around a smaller infrastructure change. The ``vla-eval data fetch`` subcommand + ``DataRequirement`` declarative metadata layer were over-built for the actual lifecycle: the license-acceptance handshake doesn't need to be a separate pre-flight step; it can be runtime, prompted on first need, just like model-server git clones already do. Moving the licence confirmation to runtime collapses the asymmetry between benchmark-asset fetch and model-server clone fetch — both become lazy, both go through the same primitives. This commit removes the abstraction. The next commit adds the runtime-licence flow and the unified host-cache resolver. Removed: - ``src/vla_eval/cli/cmd_data.py`` (the ``vla-eval data fetch`` subcommand and its docker-side fetch dispatch). - ``DataRequirement`` dataclass and ``Benchmark.data_requirements`` classmethod on ``src/vla_eval/benchmarks/base.py``. - ``Behavior1KBenchmark.data_requirements`` method. - ``cmd_data.register(sub)`` wiring in ``cli/main.py``. Reverted to the PR #57 baseline: - ``configs/behavior1k_eval.yaml`` — the data-fetch comment block and the OmegaConf volume interpolation; the next commit puts the interpolation back in extended XDG-aware form. - ``docs/reproductions/behavior1k.md`` step 2. - ``.claude/skills/add-benchmark/SKILL.md`` ``data_requirements`` section. Kept (independent improvements that survive this rewrite): - ``cli/_console.py``, ``cli/_docker.py`` (helper hoists). - ``cli/config_loader.py`` always-on OmegaConf interpolation. - ``Behavior1KBenchmark.task_instance_id`` per-episode sweep. - Demo-replay per-(session, episode) cursor + ``on_episode_start`` fail-loud hook. - ``Behavior1KBenchmark.get_metadata`` declaring ``action_dim=23``. - README "Build-locally images" caption + 🔒 marker on rlbench/ behavior1k rows; CONTRIBUTING benchmark roster refresh.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark, plus a zero-action baseline and a demo-replay model server used to verify env wiring against the released LeRobot v2.1 trajectories.
What's in here
Benchmark module
src/vla_eval/benchmarks/behavior1k/benchmark.py—Behavior1KBenchmark(StepBenchmark), R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden soreset/step/cleanuprun on a worker thread (Isaac Sim'sSimulationApp.__init__callssignal.signal, which has to be monkey-patched while the worker thread is set up —signal.signalassumes the main thread).gm.HEADLESS=Trueset beforeog.launch(required to avoid an XR-extension segfault on the cluster).task_instance_idacceptsint | list[int] | None. Scalar fixes the same instance for every episode (demo-replay use case); a list sweepsepisode_idx % len(list)to cover the 50-task × 10-instance challenge protocol.Model servers
behavior1k_baseline.py— zero-action 23-D baseline. Smoke-test sanity check.behavior1k_demo_replay.py— plays back recorded actions from a LeRobot v2.1 parquet episode. The step cursor is keyed on(session_id, episode_id)and managed viaon_episode_start/on_episode_endso concurrent benchmark sessions on one server don't share state.Docker
Dockerfile.behavior1k— installs Isaac Sim 4.5.0 frompypi.nvidia.com(26omni-*/isaacsim-*wheels),bddl3+OmniGibson[eval]+joylofrom BEHAVIOR-1K v3.7.2.ARG ACCEPT_NVIDIA_EULA=YES(NVIDIA Omniverse EULA — https://docs.omniverse.nvidia.com/eula/). Surfaced viadocker/build.sh --accept-license behavior1k, dispatched through theEULA_GATEDmap added in docker: opt-in license gate for rlbench builds #55.NO_REDISTindocker/push.sh, so the image is build-locally-only.Configs / docs
configs/behavior1k_eval.yaml—turning_on_radio, task instance 1, max 2000 steps.configs/model_servers/behavior1k/baseline.yaml— zero-action server.configs/model_servers/behavior1k/demo_replay.yaml— demo-replay server (parquet path placeholder).docs/reproductions/behavior1k.md— full repro write-up. Result data archived underdocs/reproductions/data/.README.md— BEHAVIOR-1K badge in the support table promoted fromplanned(·) tointegrated(◇); Docker Images table marksrlbenchandbehavior1krows as build-locally with a 🔒 indicator and a one-line caption explaining the licence opt-in.CONTRIBUTING.md— the "Project Structure" benchmark roster updated to match the actual contents ofsrc/vla_eval/benchmarks/.Verification
The demo-replay success step (1364) falls inside the human-annotated press-skill window
[1162, 1434]from the BEHAVIOR Dataset annotations, which gives reasonable confidence that the env wiring (action space, observation cameras, success detection) matches the upstream evaluation.Skill notes
.claude/skills/add-benchmarkand.claude/skills/add-model-servergain a short note not to addtests/test_<name>_benchmark.pyortests/test_<name>_server.pywith mocked sim / model libraries.tests/is for harness mechanics, not per-sim integration; mockedomnigibson/sapien/mujocomodules drift from upstream and miss the real bugs (import paths, action encoding, physics determinism). Verification is done via the smoke-test commands above.Checklist
Code changes:
make checkpasses (ruff + ty)make testpasses (pytest) — 296 passed, 1 skippedSmoke test commands run: