Skip to content

feat(behavior1k): add BEHAVIOR-1K benchmark integration#57

Merged
MilkClouds merged 6 commits intomainfrom
feat/behavior1k-integration
Apr 30, 2026
Merged

feat(behavior1k): add BEHAVIOR-1K benchmark integration#57
MilkClouds merged 6 commits intomainfrom
feat/behavior1k-integration

Conversation

@MilkClouds
Copy link
Copy Markdown
Collaborator

@MilkClouds MilkClouds commented Apr 29, 2026

Summary

Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark, plus a zero-action baseline and a demo-replay model server used to verify env wiring against the released LeRobot v2.1 trajectories.

Replaces #56 (auto-closed when its stacked base feat/rlbench-license-guard was deleted on merge of #55). Rebased onto main.

What's in here

Benchmark module

  • src/vla_eval/benchmarks/behavior1k/benchmark.pyBehavior1KBenchmark(StepBenchmark), R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden so reset / step / cleanup run on a worker thread (Isaac Sim's SimulationApp.__init__ calls signal.signal, which has to be monkey-patched while the worker thread is set up — signal.signal assumes the main thread).
  • Lazy imports for OmniGibson (heavy startup, can't be at module level — registry resolves the class without loading the sim).
  • gm.HEADLESS=True set before og.launch (required to avoid an XR-extension segfault on the cluster).
  • task_instance_id accepts int | list[int] | None. Scalar fixes the same instance for every episode (demo-replay use case); a list sweeps episode_idx % len(list) to cover the 50-task × 10-instance challenge protocol.

Model servers

  • behavior1k_baseline.py — zero-action 23-D baseline. Smoke-test sanity check.
  • behavior1k_demo_replay.py — plays back recorded actions from a LeRobot v2.1 parquet episode. The step cursor is keyed on (session_id, episode_id) and managed via on_episode_start / on_episode_end so concurrent benchmark sessions on one server don't share state.

Docker

  • Dockerfile.behavior1k — installs Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-* wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K v3.7.2.
  • Build is gated behind ARG ACCEPT_NVIDIA_EULA=YES (NVIDIA Omniverse EULA — https://docs.omniverse.nvidia.com/eula/). Surfaced via docker/build.sh --accept-license behavior1k, dispatched through the EULA_GATED map added in docker: opt-in license gate for rlbench builds #55.
  • Listed under NO_REDIST in docker/push.sh, so the image is build-locally-only.

Configs / docs

  • configs/behavior1k_eval.yamlturning_on_radio, task instance 1, max 2000 steps.
  • configs/model_servers/behavior1k/baseline.yaml — zero-action server.
  • configs/model_servers/behavior1k/demo_replay.yaml — demo-replay server (parquet path placeholder).
  • docs/reproductions/behavior1k.md — full repro write-up. Result data archived under docs/reproductions/data/.
  • README.md — BEHAVIOR-1K badge in the support table promoted from planned (·) to integrated (); Docker Images table marks rlbench and behavior1k rows as build-locally with a 🔒 indicator and a one-line caption explaining the licence opt-in.
  • CONTRIBUTING.md — the "Project Structure" benchmark roster updated to match the actual contents of src/vla_eval/benchmarks/.

Verification

# Demo replay (task = turning_on_radio, instance 1)
vla-eval run -c configs/behavior1k_eval.yaml \
            -m configs/model_servers/behavior1k/demo_replay.yaml
# → success=True, finished at step 1364/2000, wall 2933.8s

# Zero-action baseline
vla-eval run -c configs/behavior1k_eval.yaml \
            -m configs/model_servers/behavior1k/baseline.yaml
# → success=False at max_steps (expected)

The demo-replay success step (1364) falls inside the human-annotated press-skill window [1162, 1434] from the BEHAVIOR Dataset annotations, which gives reasonable confidence that the env wiring (action space, observation cameras, success detection) matches the upstream evaluation.

Skill notes

.claude/skills/add-benchmark and .claude/skills/add-model-server gain a short note not to add tests/test_<name>_benchmark.py or tests/test_<name>_server.py with mocked sim / model libraries. tests/ is for harness mechanics, not per-sim integration; mocked omnigibson / sapien / mujoco modules drift from upstream and miss the real bugs (import paths, action encoding, physics determinism). Verification is done via the smoke-test commands above.

Checklist

Code changes:

  • make check passes (ruff + ty)
  • make test passes (pytest) — 296 passed, 1 skipped
  • Smoke-tested affected configs (demo replay + baseline runs above)

Smoke test commands run:

make check
make test
docker/build.sh behavior1k --accept-license behavior1k
vla-eval run -c configs/behavior1k_eval.yaml -m configs/model_servers/behavior1k/baseline.yaml
vla-eval run -c configs/behavior1k_eval.yaml -m configs/model_servers/behavior1k/demo_replay.yaml

Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark.
The integration covers the standard StepBenchmark surface plus a
demo-replay model server used to verify the dataloader against the
released LeRobot v2.1 trajectories.

What's added:

- ``src/vla_eval/benchmarks/behavior1k/benchmark.py``
  Behavior1KBenchmark with the required StepBenchmark methods.
  R1Pro robot, 23-D action, RGB head + L/R wrist cameras.  The async
  bridge is overridden so reset/step/cleanup run on a worker thread
  (Isaac Sim's SimulationApp ``signal.signal`` calls have to be
  monkey-patched while the worker thread is set up — they assume the
  main thread).

- ``src/vla_eval/model_servers/behavior1k_baseline.py``
  Zero-action baseline (smoke-test sanity check).

- ``src/vla_eval/model_servers/behavior1k_demo_replay.py``
  Plays back the recorded actions from a LeRobot v2.1 parquet
  episode.  Used to verify the env wiring matches the dataset.

- ``docker/Dockerfile.behavior1k``
  Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-*
  wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K
  v3.7.2.  Gated behind ``ARG ACCEPT_NVIDIA_EULA=YES`` (NVIDIA
  Omniverse EULA, see https://docs.omniverse.nvidia.com/eula/).

- ``configs/behavior1k_eval.yaml`` — turning_on_radio task instance 1
- ``configs/model_servers/behavior1k/baseline.yaml`` — zero-action server
- ``docs/reproductions/behavior1k.md`` — repro write-up + data files

The behavior1k entry is registered in ``docker/build.sh`` (gated
via ``--accept-license behavior1k``) and listed under ``NO_REDIST``
in ``docker/push.sh`` so the image is built locally only.

Verification:

- demo-replay on turning_on_radio (task instance 1) → success=True
  at step 1364/2000 (within the human-annotated press-skill window
  [1162, 1434] from the BEHAVIOR Dataset annotations).
- zero-action baseline → success=False at max_steps (expected).

Skill notes (``.claude/skills/add-benchmark`` and ``add-model-server``)
gain a short reminder not to add ``tests/test_<name>_benchmark.py``
or ``tests/test_<name>_server.py`` with mocked sim/model libraries —
``tests/`` is for harness mechanics, not per-sim integration, and
mocked modules drift from upstream and miss real bugs.
Companion to the baseline.yaml — points the demo-replay server at a
LeRobot v2.1 parquet episode.  ``demo_path`` is a placeholder; users
swap in their own path before running.
CI surfaced three ty errors after the rebase:

- ``anyio.to_thread.run_sync(...)`` was unresolved through the module
  attribute path.  Use the same import-as-name style the rest of the
  codebase already uses (``predict.py``, ``serve.py``, ``rtc.py``).
- ``signal.signal = lambda ...`` triggered ``invalid-assignment``.
  Use ``setattr`` so the rebinding is opaque to the type checker
  (the runtime behaviour — restoring the handler in ``finally`` —
  is unchanged).
- Drop the leftover ``# type: ignore`` mypy-style pragmas that were
  carrying the old workarounds; ty doesn't honour them anyway.

While here, refresh the docs that mention benchmark coverage:

- ``README.md``: BEHAVIOR-1K badge promoted from ``planned`` to
  ``integrated``; rlbench dropped from the registry-pulled image
  table; new "Build-locally images" note covering rlbench and
  behavior1k; build-script example shows ``--accept-license``.
- ``CONTRIBUTING.md``: integrated-benchmark roster updated to match
  the actual contents of ``src/vla_eval/benchmarks/`` (was missing
  LIBERO-Plus/Mem, RoboMME, MolmoSpaces, Kinetix; now also adds
  BEHAVIOR-1K).
Build-locally images (rlbench, behavior1k) now appear in both the
top-of-readme support table and the Docker Images table, with a
🔒 marker indicating they're not pulled from ghcr.io and require an
explicit licence opt-in.

- Top support table: 🔒 appended after the rlbench and behavior1k
  badges.  Status legend gains a fourth entry explaining 🔒.
- Docker Images table: rlbench is restored (was dropped in the prior
  pass), behavior1k is added at its 23.6 GB position.  For both, the
  Image column shows the name without a ghcr.io link, and the row
  carries 🔒.
- Replaces the earlier "Build-locally images" paragraph with a single
  caption under the table that explains the marker.
Reverts the 🔒 markers added next to the RLBench / BEHAVIOR-1K
badges in the top support table, and the matching legend entry.
Build-mechanism details belong in the Docker Images table further
down — the support table just tracks integration / reproduction
status.
Three issues from review:

- ``Behavior1KBenchmark.task_instance_id`` was set once at
  construction and never varied, so ``episodes_per_task > 1`` runs
  reloaded the same TRO state every episode (and aggregate scores
  could not match the 50-task × 10-instance challenge protocol).
  Accept ``int | list[int] | None`` and index by
  ``task["episode_idx"]`` cyclically when a list is given; the
  scalar form preserves the demo-replay use case.

- ``Behavior1KDemoReplayModelServer`` kept a single
  ``_current_episode_id`` / ``_step_idx`` for the whole process,
  so two concurrent benchmark sessions on one server would race
  and consume a mixed action stream.  Key the cursor on
  ``(session_id, episode_id)``, initialise in
  ``on_episode_start`` and free in ``on_episode_end`` so the
  dict stays bounded.

- ``docs/reproductions/behavior1k.md`` build command did not pass
  ``--accept-license behavior1k``, so the new gated build skipped
  the image and the next step failed with "image not found".
  Updated the command and added the licence URL inline.
MilkClouds added a commit that referenced this pull request Apr 30, 2026
…ctions

Re-orient PR #58 around a smaller infrastructure change.  The
``vla-eval data fetch`` subcommand + ``DataRequirement`` declarative
metadata layer were over-built for the actual lifecycle: the
license-acceptance handshake doesn't need to be a separate pre-flight
step; it can be runtime, prompted on first need, just like model-server
git clones already do.  Moving the licence confirmation to runtime
collapses the asymmetry between benchmark-asset fetch and model-server
clone fetch — both become lazy, both go through the same primitives.

This commit removes the abstraction.  The next commit adds the
runtime-licence flow and the unified host-cache resolver.

Removed:

- ``src/vla_eval/cli/cmd_data.py`` (the ``vla-eval data fetch``
  subcommand and its docker-side fetch dispatch).
- ``DataRequirement`` dataclass and ``Benchmark.data_requirements``
  classmethod on ``src/vla_eval/benchmarks/base.py``.
- ``Behavior1KBenchmark.data_requirements`` method.
- ``cmd_data.register(sub)`` wiring in ``cli/main.py``.

Reverted to the PR #57 baseline:

- ``configs/behavior1k_eval.yaml`` — the data-fetch comment block and
  the OmegaConf volume interpolation; the next commit puts the
  interpolation back in extended XDG-aware form.
- ``docs/reproductions/behavior1k.md`` step 2.
- ``.claude/skills/add-benchmark/SKILL.md`` ``data_requirements``
  section.

Kept (independent improvements that survive this rewrite):

- ``cli/_console.py``, ``cli/_docker.py`` (helper hoists).
- ``cli/config_loader.py`` always-on OmegaConf interpolation.
- ``Behavior1KBenchmark.task_instance_id`` per-episode sweep.
- Demo-replay per-(session, episode) cursor + ``on_episode_start``
  fail-loud hook.
- ``Behavior1KBenchmark.get_metadata`` declaring ``action_dim=23``.
- README "Build-locally images" caption + 🔒 marker on rlbench/
  behavior1k rows; CONTRIBUTING benchmark roster refresh.
@MilkClouds MilkClouds merged commit 7c8afa9 into main Apr 30, 2026
6 checks passed
@MilkClouds MilkClouds deleted the feat/behavior1k-integration branch April 30, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant