11"""BEHAVIOR-1K benchmark implementation.
22
3- BEHAVIOR-1K is a long-horizon household-activity benchmark built on
4- OmniGibson (NVIDIA Isaac Sim). The 2025 BEHAVIOR Challenge defines a
5- 50-task evaluation suite (B10/B20/B30/B40/B50) using the R1Pro
6- mobile-manipulation robot.
3+ BEHAVIOR-1K is a long-horizon household-activity benchmark built on OmniGibson (NVIDIA Isaac Sim).
4+ The 2025 BEHAVIOR Challenge defines a 50-task evaluation suite (B10/B20/B30/B40/B50) using the
5+ R1Pro mobile-manipulation robot.
76
87References:
98 - https://behavior.stanford.edu
1615 base[0:3], torso[3:7], left_arm[7:14], left_gripper[14:15],
1716 right_arm[15:22], right_gripper[22:23].
1817 - Cameras: head 720x720, left_wrist 480x480, right_wrist 480x480.
19- - Success: ``info["done"]["success"]`` (binary); the challenge
20- separately reports a partial Q-score, but we only surface the
21- binary flag here — partial scoring lives in the official
18+ - Success: ``info["done"]["success"]`` (binary); the challenge separately reports a partial
19+ Q-score, but we only surface the binary flag here — partial scoring lives in the official
2220 ``score_utils.compute_final_q_score``.
2321 - Max steps default: 5000 (or 2× human demo length when known).
2422"""
@@ -126,54 +124,43 @@ class Behavior1KBenchmark(StepBenchmark):
126124 """BEHAVIOR-1K (OmniGibson) household-activity benchmark.
127125
128126 Non-obvious behaviors:
129- - **Heavy lazy imports**: ``omnigibson`` and Isaac Sim are imported
130- inside ``_init_og()`` rather than at module top. Importing
131- OmniGibson boots the Isaac Sim runtime and consumes several
132- gigabytes of VRAM, so we delay it until ``get_tasks()`` /
133- ``reset()`` actually need it. This also keeps
134- ``vla-eval test --validate`` (a pure import-string check) fast.
135- - **Action format**: ``env.step()`` expects a ``torch.Tensor``,
136- not numpy. We convert in ``step()``.
127+ - **Heavy lazy imports**: ``omnigibson`` and Isaac Sim are imported inside ``_init_og()``
128+ rather than at module top. Importing OmniGibson boots the Isaac Sim runtime and consumes
129+ several gigabytes of VRAM, so we delay until ``get_tasks()`` / ``reset()`` actually need
130+ it. Also keeps ``vla-eval test --validate`` (a pure import-string check) fast.
131+ - **Action format**: ``env.step()`` expects a ``torch.Tensor``, not numpy. Converted in
132+ ``step()``.
137133 - **Observation flattening**: OmniGibson's nested observation
138- (``obs["robot_r1"]["sensors"]["zed"]["rgb"]``) is flattened with
139- a ``::`` delimiter via the official ``flatten_obs_dict`` helper.
140- We then look up cameras by their canonical sensor key.
141- - **Task description**: BehaviorTask does not expose a natural
142- language instruction; we use the snake-case task name with
143- underscores replaced by spaces, matching common VLA practice.
144- - **Single robot supported**: R1Pro only (the BEHAVIOR Challenge
145- 2025 standard track). A1 is reachable through OmniGibson but
146- not exercised here.
134+ (``obs["robot_r1"]["sensors"]["zed"]["rgb"]``) is flattened with a ``::`` delimiter via
135+ the official ``flatten_obs_dict`` helper. We then look up cameras by their canonical
136+ sensor key.
137+ - **Task description**: BehaviorTask does not expose a natural language instruction; we use
138+ the snake-case task name with underscores replaced by spaces, matching common VLA practice.
139+ - **Single robot supported**: R1Pro only (the BEHAVIOR Challenge 2025 standard track). A1
140+ is reachable through OmniGibson but not exercised here.
147141
148142 Args:
149143 tasks: Subset of B50 task names to evaluate. ``None`` runs all 50.
150- partial_scene_load: Pass through to OmniGibson — load only rooms
151- relevant to the task to speed up scene construction.
152- max_steps: Per-episode step cap. ``None`` keeps OmniGibson's
153- default (5000 in ``generate_basic_environment_config``).
154- send_proprio: Include the R1Pro proprio vector
155- (``robot_r1::proprio``, 256-D) in observations.
156- camera_names: Which cameras to forward to the model server.
157- Defaults to all three (``head``, ``left_wrist``, ``right_wrist``).
158- env_wrapper_target: Hydra ``_target_`` for the env wrapper. By
159- default we use OmniGibson's ``EnvironmentWrapper`` no-op
160- wrapper; override to plug in challenge-specific behaviour.
161- task_instance_id: Per-instance TRO state(s) to load after
162- ``env.reset()``, mirroring the official
163- ``Evaluator.load_task_instance``. Without this the env
164- starts from BehaviorTask's default instance (idx 0); with
165- it set, the cached
166- ``<scene>_task_<activity>_instances/<...>-tro_state.json``
167- is applied so the initial object placement matches the
168- recorded demos. Required for demo-replay reproductions.
144+ partial_scene_load: Pass through to OmniGibson — load only rooms relevant to the task to
145+ speed up scene construction.
146+ max_steps: Per-episode step cap. ``None`` keeps OmniGibson's default (5000 in
147+ ``generate_basic_environment_config``).
148+ send_proprio: Include the R1Pro proprio vector (``robot_r1::proprio``, 256-D) in observations.
149+ camera_names: Which cameras to forward to the model server. Defaults to all three
150+ (``head``, ``left_wrist``, ``right_wrist``).
151+ env_wrapper_target: Hydra ``_target_`` for the env wrapper. By default we use OmniGibson's
152+ ``EnvironmentWrapper`` no-op wrapper; override to plug in challenge-specific behaviour.
153+ task_instance_id: Per-instance TRO state(s) to load after ``env.reset()``, mirroring the
154+ official ``Evaluator.load_task_instance``. Without this the env starts from
155+ BehaviorTask's default instance (idx 0); with it set, the cached
156+ ``<scene>_task_<activity>_instances/<...>-tro_state.json`` is applied so the initial
157+ object placement matches the recorded demos. Required for demo-replay reproductions.
169158
170159 Accepts:
171- - ``None`` — use BehaviorTask's default instance every
172- episode (no TRO state load).
160+ - ``None`` — use BehaviorTask's default instance every episode (no TRO state load).
173161 - ``int`` — fix the same instance for every episode.
174- - ``list[int]`` — sweep instances; episode ``i`` uses
175- ``ids[i % len(ids)]``. Use this to reproduce the
176- challenge protocol (50 tasks × 10 instances).
162+ - ``list[int]`` — sweep instances; episode ``i`` uses ``ids[i % len(ids)]``. Use
163+ this to reproduce the challenge protocol (50 tasks × 10 instances).
177164 """
178165
179166 def __init__ (
@@ -200,8 +187,7 @@ def __init__(
200187 if unknown_cams :
201188 raise ValueError (f"Unknown R1Pro cameras: { unknown_cams } . Valid: { list (R1PRO_CAMERAS )} " )
202189 self ._env_wrapper_target = env_wrapper_target
203- # Normalize int|list|None to list[int]|None so the reset() path
204- # can index by ``episode_idx`` uniformly.
190+ # Normalize int|list|None to list[int]|None so reset() can index by ``episode_idx`` uniformly.
205191 if task_instance_id is None :
206192 self ._task_instance_ids : list [int ] | None = None
207193 elif isinstance (task_instance_id , int ):
@@ -272,12 +258,10 @@ def _ensure_assets(self, data_path: Path) -> None:
272258
273259 def _make_env (self , task_name : str ) -> Any :
274260 """Build a fresh OmniGibson env for *task_name*."""
275- # Isaac Sim's SimulationApp.__init__ calls signal.signal(SIGINT, ...)
276- # which raises ValueError when invoked from a non-main thread —
277- # but we *must* off-load env construction to a worker so the
278- # orchestrator's asyncio loop survives. The handler installed
279- # at our main-thread import of omnigibson is already in place,
280- # so it's safe to no-op the additional registration here.
261+ # Isaac Sim's SimulationApp.__init__ calls signal.signal(SIGINT, ...) which raises ValueError
262+ # when invoked from a non-main thread — but we *must* off-load env construction to a worker
263+ # so the orchestrator's asyncio loop survives. The handler installed at our main-thread
264+ # import of omnigibson is already in place, so it's safe to no-op the additional registration.
281265 import signal as _signal
282266 import threading
283267
@@ -307,8 +291,7 @@ def _make_env_inner(self, task_name: str) -> Any:
307291 generate_basic_environment_config ,
308292 )
309293
310- # The official eval disables a curated set of transition rules to
311- # match the data-collection setup.
294+ # The official eval disables a curated set of transition rules to match the data-collection setup.
312295 for rule in DISABLED_TRANSITION_RULES :
313296 rule .ENABLED = False
314297
@@ -355,13 +338,11 @@ def reset(self, task: Task) -> Any:
355338 self ._env = self ._make_env (task_name )
356339 self ._current_task_name = task_name
357340 obs , _ = self ._env .reset ()
358- # Optional per-instance TRO state load (matches official
359- # ``Evaluator.load_task_instance``). When unset, BehaviorTask
360- # uses its default instance (idx 0) — the env still runs, but
361- # object placements may diverge from a particular demo.
362- # When a list is provided, sweep instances by ``episode_idx``
363- # so consecutive episodes hit different recorded states (the
364- # 50 task × 10 instance challenge protocol).
341+ # Optional per-instance TRO state load (matches official ``Evaluator.load_task_instance``).
342+ # When unset, BehaviorTask uses its default instance (idx 0) — the env still runs, but object
343+ # placements may diverge from a particular demo. When a list is provided, sweep instances by
344+ # ``episode_idx`` so consecutive episodes hit different recorded states (the 50-task ×
345+ # 10-instance challenge protocol).
365346 if self ._task_instance_ids is not None :
366347 episode_idx = int (task .get ("episode_idx" , 0 ))
367348 instance_id = self ._task_instance_ids [episode_idx % len (self ._task_instance_ids )]
@@ -371,13 +352,12 @@ def reset(self, task: Task) -> Any:
371352 def _load_task_instance (self , instance_id : int ) -> Any :
372353 """Apply per-instance object/robot state JSON, then re-fetch obs.
373354
374- Ports the v3.7.2 ``Evaluator.load_task_instance`` (public-test
375- branch). Reads
355+ Ports the v3.7.2 ``Evaluator.load_task_instance`` (public-test branch). Reads
376356 ``<get_task_instance_path(scene)>/json/<scene>_task_<activity>_instances/<...>-tro_state.json``
377357 and pushes the recorded object/robot state into the running env.
378358
379- Compatible only with the v3.7.2 OmniGibson API: uses
380- ``robot.model_name``, `` entity.is_system`` / ``entity.exists``.
359+ Compatible only with the v3.7.2 OmniGibson API: uses ``robot.model_name``,
360+ ``entity.is_system`` / ``entity.exists``.
381361 """
382362 import json
383363 import os
@@ -429,8 +409,8 @@ def _load_task_instance(self, instance_id: int) -> Any:
429409 env .scene .update_initial_file ()
430410 env .scene .reset ()
431411
432- # Re-fetch the observation after the state load so the model
433- # server sees the post-load images / proprio.
412+ # Re-fetch the observation after the state load so the model server sees the post-load
413+ # images / proprio.
434414 obs , _ = env .get_obs ()
435415 return obs
436416
@@ -505,31 +485,24 @@ def cleanup(self) -> None:
505485 except Exception :
506486 logger .exception ("BEHAVIOR-1K env close failed" )
507487 self ._env = None
508- # Intentionally NOT calling ``omnigibson.shutdown()`` here:
509- # Isaac Sim's shutdown path can hang for many minutes
510- # (waiting on hydra texture cleanup, render contexts, etc.)
511- # which prevents the orchestrator from writing the result JSON
512- # at the end of the run. Process exit reclaims everything;
513- # leaving Isaac Sim alone is the lesser evil.
488+ # Intentionally NOT calling ``omnigibson.shutdown()`` here: Isaac Sim's shutdown path can hang
489+ # for many minutes (waiting on hydra texture cleanup, render contexts, etc.) which prevents
490+ # the orchestrator from writing the result JSON at the end of the run. Process exit reclaims
491+ # everything; leaving Isaac Sim alone is the lesser evil.
514492
515- # ------------------------------------------------------------------
516- # Async bridge override: run sync reset()/step() on a worker thread.
517- # Booting Isaac Sim from the orchestrator's main thread tears down
518- # the running asyncio event loop (SimulationApp installs its own),
519- # which makes the next `await conn.act(...)` raise NoEventLoopError.
520- # Off-loading to ``anyio.to_thread.run_sync`` keeps the orchestrator
521- # loop intact while Isaac Sim does its synchronous work.
522- # ------------------------------------------------------------------
493+ # Async bridge override: run sync reset()/step() on a worker thread. Booting Isaac Sim from the
494+ # orchestrator's main thread tears down the running asyncio event loop (SimulationApp installs
495+ # its own), which makes the next ``await conn.act(...)`` raise NoEventLoopError. Off-loading
496+ # to ``anyio.to_thread.run_sync`` keeps the orchestrator loop intact while Isaac Sim does its
497+ # synchronous work.
523498
524499 async def start_episode (self , task : Task ) -> None :
525500 self ._t0 = time .monotonic ()
526501 self ._task = task
527- # Run imports + signal-handler registration on the main thread
528- # (Python's signal module forbids setting handlers from a worker
529- # thread, and OmniGibson registers SIGINT during its top-level
530- # ``__init__.py``). Only the env construction / reset itself is
531- # offloaded to the worker thread, which is what actually trashes
532- # the asyncio event loop.
502+ # Run imports + signal-handler registration on the main thread (Python's signal module forbids
503+ # setting handlers from a worker thread, and OmniGibson registers SIGINT during its top-level
504+ # ``__init__.py``). Only the env construction / reset itself is offloaded to the worker
505+ # thread, which is what actually trashes the asyncio event loop.
533506 self ._init_og ()
534507 raw_obs = await _run_in_thread (self .reset , task )
535508 self ._last_result = StepResult (obs = raw_obs , reward = 0.0 , done = False , info = {})
0 commit comments