Commit 21ba002
feat(robomme): subgoal, episode video, H100 lavapipe fallback (#83)
* feat(robomme): expose per-step subgoal text from env info dict
Adds opt-in `send_subgoal` (default False) and `subgoal_mode` ("grounded"
or "simple", default "grounded") so models that condition on subgoal text
during training can receive it at eval. Pulled directly from
`info['grounded_subgoal_online']` / `info['simple_subgoal_online']`,
which `DemonstrationWrapper` always populates upstream. Grounded falls
back to simple when the placeholder isn't filled yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robomme): switch SAPIEN to lavapipe Vulkan to avoid H100 hang
`mani_skill ... _setup_scene()` calls `sapien.render.RenderSystem("cuda:0")`,
which hangs on H100/A100 hosts running this build of mani_skill+sapien
against the NVIDIA Vulkan driver — observed as 100% GPU util with no
progress past env init. Direct probe inside the docker image confirmed
the hang is at scene init, before any benchmark code runs.
Restore the workaround used in the prior harness: when a lavapipe ICD
is present in the container (`/opt/lavapipe/...` or
`/usr/share/vulkan/icd.d/lvp_icd.x86_64.json`, which mesa-vulkan-drivers
provides), switch `VK_ICD_FILENAMES` to it, drop the `device` arg from
`RenderSystem`, and force mani_skill's render backend to `sapien_cpu`.
On hosts without lavapipe the function is a no-op.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(docker/base): make NVIDIA ICD writes idempotent
When the Docker daemon's default-runtime is `nvidia` (common on hosts
that primarily run GPU workloads), nvidia-container-runtime injects
the NVIDIA ICD JSONs into every container as read-only bind-mounts —
including the build container. Unconditional `printf > file` then fails
with "Read-only file system" and the base image build aborts at step 5.
Skip the write when a non-empty ICD JSON is already present at the
target path. Behavior on hosts with default-runtime=runc is unchanged
(file is absent → printf creates it).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robomme): declare 'subgoal' key in observation spec
`make_obs()` attaches `obs["subgoal"]` when `send_subgoal=True`, but
`get_observation_spec()` didn't list it — the orchestrator's spec
validation would warn (and could in principle reject) the unexpected
key. Add it as `LANGUAGE`, gated on the same flag.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(robomme): optionally record per-episode mp4 videos
Adds opt-in `save_episode_video: bool = False` and `video_dir: str | None`
(default `/workspace/results/videos` so it lands inside the bench
container's standard results bind-mount). On each `step()`, the latest
agentview frame is appended to a per-episode buffer; on `get_step_result()`
the buffer is encoded to mp4 with imageio at 10 fps. Filename encodes the
task name, episode index, and success/fail status.
No-op when `save_episode_video=False`. When enabled and an exception
occurs during encoding, logs a warning and resets the buffer rather than
failing the episode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robomme): record episode mp4 at 20 fps to match source data
The robomme env runs at 20 Hz; the prior 10 fps encoder slowed playback
to half speed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(robomme): tighten subgoal_mode type, log monkeypatch failure, gate per-step subgoal extraction
Three small follow-ups from a review pass:
- Type `subgoal_mode` as `Literal["grounded", "simple"]` (the runtime
ValueError stays for yaml-driven configs that bypass the type
checker).
- Replace the bare `except Exception: pass` around the mani_skill
render-backend patch with a `logger.warning` so the silent failure
mode is visible at eval time.
- Skip `_extract_subgoal()` on every `step()` when `send_subgoal=False`.
The extraction is cheap but pointless when the model isn't asking
for it; this avoids ~2M no-op calls per shard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robomme): copy frames, capture initial frame, detect unfilled subgoal placeholders
Three follow-ups from a Gemini review on PR #77:
- Use `np.array(..., copy=True)` instead of `np.asarray(...)` when
appending agentview frames to the episode buffer. ManiSkill may reuse
the same observation buffer across steps; without an explicit copy the
saved video would be every-frame-the-last-frame.
- Capture the initial frame in `reset()` after the env resets so the
saved video covers the entire episode duration, not just from the
first action onward.
- `_extract_subgoal()` now also falls back to simple when the grounded
text still contains an unfilled `<identifier>` placeholder (matched
by `_UNFILLED_PLACEHOLDER_RE`), not only when grounded is empty/None.
Filled coords (`<70, 84>`) start with a digit and are unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robomme): move _rendering_configured below docstring
The class attribute was placed before the triple-quoted string, which
demoted the docstring to a discarded expression and made
``RoboMMEBenchmark.__doc__`` evaluate to ``None``. Move the attribute
after the docstring so it stays a docstring.
Caught by MilkClouds in PR #77 review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(robomme): add ROBOMME_DISABLE_LAVAPIPE opt-out env var
Hosts where NVIDIA Vulkan is healthy don't need the lavapipe workaround
and pay a ~5–10× perf cost when it's applied (~12 fps vs ~33–80 fps at
256×256). The hang is concentrated on hosts running the closed-source
NVIDIA kernel module; open-kernel hosts (`Dual MIT/GPL`) pass cleanly
with the native Vulkan path.
`ROBOMME_DISABLE_LAVAPIPE=1` on those hosts (or in launcher env) skips
the entire workaround and lets SAPIEN use NVIDIA Vulkan directly. On
H100-9-class hosts (closed kernel) leave the env var unset and the
existing lavapipe path activates as before.
Default behaviour unchanged: lavapipe still auto-engages when its ICD
is present, so existing eval scripts that don't set the env var keep
running.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(robomme): flip lavapipe path to opt-in (ROBOMME_USE_LAVAPIPE)
Cross-host fleet probing on our DGX-H100 cluster (3/7 nodes hang, 4/7
work; kernel module is not the discriminator — H100-2/8 hang on the
open-source kernel module while H100-6/10/12 don't, with byte-identical
packages) made it clear that the lavapipe workaround is the *minority*
case, not the default. Inverting the env-var matches that:
- Default: native NVIDIA Vulkan (~30–80 fps at 256×256, no patch).
- `ROBOMME_USE_LAVAPIPE=1`: engage the 3-piece monkey-patch on hosts
that hang in SAPIEN's `submitAndWait` for image upload.
Removes the ICD-presence auto-engage path — on the published
`worv-ai/...:latest` image lavapipe ICD isn't shipped, so this was a
no-op there; on the locally-rebuilt image it auto-engaged regardless
of host health. Now lavapipe runs only when explicitly requested,
which is the conservative default.
Renames the opt-out env var (`ROBOMME_DISABLE_LAVAPIPE`) accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(robomme): ROBOMME_USE_LAVAPIPE=auto + lavapipe perf tuning (#82)
* feat(robomme): add ROBOMME_USE_LAVAPIPE=auto + LP_NUM_THREADS speedup
- New `auto` mode for ROBOMME_USE_LAVAPIPE: probes the native NVIDIA
Vulkan path in a child process (RenderSystem → take_picture →
get_picture, watchdogged). If it completes, leaves SAPIEN alone for
~30–80 fps native rendering. If the child hangs, engages the
lavapipe fallback in this process. Lets a single launcher work
across heterogeneous H100 fleets (some hosts hang, some don't,
not predictable from software config).
- When the lavapipe path is engaged (whether auto-detected or
explicitly via =1), set LP_NUM_THREADS=4 OMP_NUM_THREADS=1
MKL_NUM_THREADS=1 if not already set. Empirical sweep on A100
(probe_lp.py) shows LP_NUM_THREADS=4 is the sweet spot for
RoboMME's 256×256 frames: ~4.8 fps vs ~3.3 fps at LP=1 vs
~3.7 fps unset. ~30% lavapipe-path speedup at zero cost.
- Refactor: extract _engage_lavapipe() so the auto path can call it
after the watchdog without code duplication.
Verified on:
- Local A100 (healthy): unset → 135 fps; auto → 96 fps (probe adds
~3 s, lavapipe NOT engaged); explicit =1 → ~5 fps via lavapipe.
- DGX-H100-2 (persistent hang): auto → probe times out at 30 s,
warns, engages lavapipe (LP_NUM_THREADS=4 set automatically),
reset() completes cleanly, 30 steps at ~10 fps.
Recommended launcher pattern: set ROBOMME_USE_LAVAPIPE=auto in the
benchmark container env. One config covers both fast-path and
slow-path hosts; no need to maintain a per-host hang allowlist.
* fix(robomme): address gemini review on PR #82
- Drop the SIGALRM watchdog inside the auto-detect probe. subprocess.run's
timeout already SIGKILLs the child on hang, so the inner alarm was
redundant — and Unix-specific (signal.SIGALRM doesn't exist on Windows).
Probe wall time is now ~timeout_s (was timeout_s + 10).
- _engage_lavapipe now bails with a loud error if sapien.render is already
in sys.modules. The env-var-based knobs (VK_ICD_FILENAMES, LP_NUM_THREADS)
only take effect on first Vulkan init, so engaging after sapien is
imported would silently no-op. Surface that explicitly instead.
Re-verified on local A100 (auto → native, 95 fps, lavapipe NOT engaged).
* refactor(robomme): simplify _setup_rendering control flow + tighten probe timeout
Per /simplify pass:
- Drop the `engage` boolean local. The two branches that fall through
to engage now do so directly; the early-return branches are still
early returns. One less name, same behaviour.
- Remove the redundant "default: leave SAPIEN on native NVIDIA path"
comment — the docstring already covers it.
- Lower _native_render_path_works default timeout from 30s → 15s.
Healthy probe completes in 3-5s (sapien import + Vulkan init);
15s leaves headroom for cold-cache hosts. Saves ~15s on the
fall-through path before lavapipe kicks in on hang hosts.
Re-verified on local A100: auto → native, 115 fps, lavapipe NOT engaged.
* refactor(robomme): drop dead Helpers/Benchmark ABC banner pair
* refactor(robomme): post-review polish for #83
- Static methods on _setup_rendering / _engage_lavapipe (neither uses self)
- Extract _resolve_lavapipe_icd helper; explicit ROBOMME_LAVAPIPE_ICD=missing
now logs an error and bails instead of silently picking the default path
- Public native_render_path_works (drop underscore) so launchers can health-
check a host before scheduling
- Drop _current_task_name / _current_episode_idx instance fields; pass task
to _save_episode_video instead
- Reject save_episode_video=True without task['episode_idx'] up front so
per-episode mp4s can never collide on filename
- Docstring: clarify video_dir is ignored without save_episode_video, and
that ROBOMME_USE_LAVAPIPE decisions are cached for the process lifetime
* fix(robomme): address review on #83
- Codex P1: _engage_lavapipe now returns bool; _setup_rendering raises
RuntimeError when engagement was requested (=1 / auto+hang) but failed.
Previously the failure was logged then silently masked by setting
_rendering_configured=True, leaving the run on the known-bad native
path that would hang on the next take_picture.
- Gemini: also patch parse_sim_and_render_backend in its source module
(mani_skill.envs.utils.system.backend), not just the already-imported
mani_skill.envs.sapien_env. Modules importing it later now also see
the patched version.
- Drop redundant `import sys` / `sys.exit(0)` from _NATIVE_PROBE.
- _extract_subgoal: skip simple-side dict lookup when grounded mode hits;
cache `str(grounded)` once.
- Defensive cleanup: clear _episode_frames at top of every reset() (handles
orchestrator crash mid-episode skipping get_step_result), and zero all
buffers in cleanup() so a benchmark instance reused across runs doesn't
leak frames.
---------
Co-authored-by: lastdefiance20 <lastdefiance20@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 5e329a0 commit 21ba002
2 files changed
Lines changed: 348 additions & 15 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
31 | 35 | | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
36 | 42 | | |
37 | 43 | | |
38 | 44 | | |
| |||
0 commit comments