Skip to content

Commit 21ba002

Browse files
MilkCloudslastdefiance20claude
committed
feat(robomme): subgoal, episode video, H100 lavapipe fallback (#83)
* feat(robomme): expose per-step subgoal text from env info dict Adds opt-in `send_subgoal` (default False) and `subgoal_mode` ("grounded" or "simple", default "grounded") so models that condition on subgoal text during training can receive it at eval. Pulled directly from `info['grounded_subgoal_online']` / `info['simple_subgoal_online']`, which `DemonstrationWrapper` always populates upstream. Grounded falls back to simple when the placeholder isn't filled yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robomme): switch SAPIEN to lavapipe Vulkan to avoid H100 hang `mani_skill ... _setup_scene()` calls `sapien.render.RenderSystem("cuda:0")`, which hangs on H100/A100 hosts running this build of mani_skill+sapien against the NVIDIA Vulkan driver — observed as 100% GPU util with no progress past env init. Direct probe inside the docker image confirmed the hang is at scene init, before any benchmark code runs. Restore the workaround used in the prior harness: when a lavapipe ICD is present in the container (`/opt/lavapipe/...` or `/usr/share/vulkan/icd.d/lvp_icd.x86_64.json`, which mesa-vulkan-drivers provides), switch `VK_ICD_FILENAMES` to it, drop the `device` arg from `RenderSystem`, and force mani_skill's render backend to `sapien_cpu`. On hosts without lavapipe the function is a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(docker/base): make NVIDIA ICD writes idempotent When the Docker daemon's default-runtime is `nvidia` (common on hosts that primarily run GPU workloads), nvidia-container-runtime injects the NVIDIA ICD JSONs into every container as read-only bind-mounts — including the build container. Unconditional `printf > file` then fails with "Read-only file system" and the base image build aborts at step 5. Skip the write when a non-empty ICD JSON is already present at the target path. Behavior on hosts with default-runtime=runc is unchanged (file is absent → printf creates it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robomme): declare 'subgoal' key in observation spec `make_obs()` attaches `obs["subgoal"]` when `send_subgoal=True`, but `get_observation_spec()` didn't list it — the orchestrator's spec validation would warn (and could in principle reject) the unexpected key. Add it as `LANGUAGE`, gated on the same flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(robomme): optionally record per-episode mp4 videos Adds opt-in `save_episode_video: bool = False` and `video_dir: str | None` (default `/workspace/results/videos` so it lands inside the bench container's standard results bind-mount). On each `step()`, the latest agentview frame is appended to a per-episode buffer; on `get_step_result()` the buffer is encoded to mp4 with imageio at 10 fps. Filename encodes the task name, episode index, and success/fail status. No-op when `save_episode_video=False`. When enabled and an exception occurs during encoding, logs a warning and resets the buffer rather than failing the episode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robomme): record episode mp4 at 20 fps to match source data The robomme env runs at 20 Hz; the prior 10 fps encoder slowed playback to half speed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(robomme): tighten subgoal_mode type, log monkeypatch failure, gate per-step subgoal extraction Three small follow-ups from a review pass: - Type `subgoal_mode` as `Literal["grounded", "simple"]` (the runtime ValueError stays for yaml-driven configs that bypass the type checker). - Replace the bare `except Exception: pass` around the mani_skill render-backend patch with a `logger.warning` so the silent failure mode is visible at eval time. - Skip `_extract_subgoal()` on every `step()` when `send_subgoal=False`. The extraction is cheap but pointless when the model isn't asking for it; this avoids ~2M no-op calls per shard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robomme): copy frames, capture initial frame, detect unfilled subgoal placeholders Three follow-ups from a Gemini review on PR #77: - Use `np.array(..., copy=True)` instead of `np.asarray(...)` when appending agentview frames to the episode buffer. ManiSkill may reuse the same observation buffer across steps; without an explicit copy the saved video would be every-frame-the-last-frame. - Capture the initial frame in `reset()` after the env resets so the saved video covers the entire episode duration, not just from the first action onward. - `_extract_subgoal()` now also falls back to simple when the grounded text still contains an unfilled `<identifier>` placeholder (matched by `_UNFILLED_PLACEHOLDER_RE`), not only when grounded is empty/None. Filled coords (`<70, 84>`) start with a digit and are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robomme): move _rendering_configured below docstring The class attribute was placed before the triple-quoted string, which demoted the docstring to a discarded expression and made ``RoboMMEBenchmark.__doc__`` evaluate to ``None``. Move the attribute after the docstring so it stays a docstring. Caught by MilkClouds in PR #77 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(robomme): add ROBOMME_DISABLE_LAVAPIPE opt-out env var Hosts where NVIDIA Vulkan is healthy don't need the lavapipe workaround and pay a ~5–10× perf cost when it's applied (~12 fps vs ~33–80 fps at 256×256). The hang is concentrated on hosts running the closed-source NVIDIA kernel module; open-kernel hosts (`Dual MIT/GPL`) pass cleanly with the native Vulkan path. `ROBOMME_DISABLE_LAVAPIPE=1` on those hosts (or in launcher env) skips the entire workaround and lets SAPIEN use NVIDIA Vulkan directly. On H100-9-class hosts (closed kernel) leave the env var unset and the existing lavapipe path activates as before. Default behaviour unchanged: lavapipe still auto-engages when its ICD is present, so existing eval scripts that don't set the env var keep running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(robomme): flip lavapipe path to opt-in (ROBOMME_USE_LAVAPIPE) Cross-host fleet probing on our DGX-H100 cluster (3/7 nodes hang, 4/7 work; kernel module is not the discriminator — H100-2/8 hang on the open-source kernel module while H100-6/10/12 don't, with byte-identical packages) made it clear that the lavapipe workaround is the *minority* case, not the default. Inverting the env-var matches that: - Default: native NVIDIA Vulkan (~30–80 fps at 256×256, no patch). - `ROBOMME_USE_LAVAPIPE=1`: engage the 3-piece monkey-patch on hosts that hang in SAPIEN's `submitAndWait` for image upload. Removes the ICD-presence auto-engage path — on the published `worv-ai/...:latest` image lavapipe ICD isn't shipped, so this was a no-op there; on the locally-rebuilt image it auto-engaged regardless of host health. Now lavapipe runs only when explicitly requested, which is the conservative default. Renames the opt-out env var (`ROBOMME_DISABLE_LAVAPIPE`) accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(robomme): ROBOMME_USE_LAVAPIPE=auto + lavapipe perf tuning (#82) * feat(robomme): add ROBOMME_USE_LAVAPIPE=auto + LP_NUM_THREADS speedup - New `auto` mode for ROBOMME_USE_LAVAPIPE: probes the native NVIDIA Vulkan path in a child process (RenderSystem → take_picture → get_picture, watchdogged). If it completes, leaves SAPIEN alone for ~30–80 fps native rendering. If the child hangs, engages the lavapipe fallback in this process. Lets a single launcher work across heterogeneous H100 fleets (some hosts hang, some don't, not predictable from software config). - When the lavapipe path is engaged (whether auto-detected or explicitly via =1), set LP_NUM_THREADS=4 OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 if not already set. Empirical sweep on A100 (probe_lp.py) shows LP_NUM_THREADS=4 is the sweet spot for RoboMME's 256×256 frames: ~4.8 fps vs ~3.3 fps at LP=1 vs ~3.7 fps unset. ~30% lavapipe-path speedup at zero cost. - Refactor: extract _engage_lavapipe() so the auto path can call it after the watchdog without code duplication. Verified on: - Local A100 (healthy): unset → 135 fps; auto → 96 fps (probe adds ~3 s, lavapipe NOT engaged); explicit =1 → ~5 fps via lavapipe. - DGX-H100-2 (persistent hang): auto → probe times out at 30 s, warns, engages lavapipe (LP_NUM_THREADS=4 set automatically), reset() completes cleanly, 30 steps at ~10 fps. Recommended launcher pattern: set ROBOMME_USE_LAVAPIPE=auto in the benchmark container env. One config covers both fast-path and slow-path hosts; no need to maintain a per-host hang allowlist. * fix(robomme): address gemini review on PR #82 - Drop the SIGALRM watchdog inside the auto-detect probe. subprocess.run's timeout already SIGKILLs the child on hang, so the inner alarm was redundant — and Unix-specific (signal.SIGALRM doesn't exist on Windows). Probe wall time is now ~timeout_s (was timeout_s + 10). - _engage_lavapipe now bails with a loud error if sapien.render is already in sys.modules. The env-var-based knobs (VK_ICD_FILENAMES, LP_NUM_THREADS) only take effect on first Vulkan init, so engaging after sapien is imported would silently no-op. Surface that explicitly instead. Re-verified on local A100 (auto → native, 95 fps, lavapipe NOT engaged). * refactor(robomme): simplify _setup_rendering control flow + tighten probe timeout Per /simplify pass: - Drop the `engage` boolean local. The two branches that fall through to engage now do so directly; the early-return branches are still early returns. One less name, same behaviour. - Remove the redundant "default: leave SAPIEN on native NVIDIA path" comment — the docstring already covers it. - Lower _native_render_path_works default timeout from 30s → 15s. Healthy probe completes in 3-5s (sapien import + Vulkan init); 15s leaves headroom for cold-cache hosts. Saves ~15s on the fall-through path before lavapipe kicks in on hang hosts. Re-verified on local A100: auto → native, 115 fps, lavapipe NOT engaged. * refactor(robomme): drop dead Helpers/Benchmark ABC banner pair * refactor(robomme): post-review polish for #83 - Static methods on _setup_rendering / _engage_lavapipe (neither uses self) - Extract _resolve_lavapipe_icd helper; explicit ROBOMME_LAVAPIPE_ICD=missing now logs an error and bails instead of silently picking the default path - Public native_render_path_works (drop underscore) so launchers can health- check a host before scheduling - Drop _current_task_name / _current_episode_idx instance fields; pass task to _save_episode_video instead - Reject save_episode_video=True without task['episode_idx'] up front so per-episode mp4s can never collide on filename - Docstring: clarify video_dir is ignored without save_episode_video, and that ROBOMME_USE_LAVAPIPE decisions are cached for the process lifetime * fix(robomme): address review on #83 - Codex P1: _engage_lavapipe now returns bool; _setup_rendering raises RuntimeError when engagement was requested (=1 / auto+hang) but failed. Previously the failure was logged then silently masked by setting _rendering_configured=True, leaving the run on the known-bad native path that would hang on the next take_picture. - Gemini: also patch parse_sim_and_render_backend in its source module (mani_skill.envs.utils.system.backend), not just the already-imported mani_skill.envs.sapien_env. Modules importing it later now also see the patched version. - Drop redundant `import sys` / `sys.exit(0)` from _NATIVE_PROBE. - _extract_subgoal: skip simple-side dict lookup when grounded mode hits; cache `str(grounded)` once. - Defensive cleanup: clear _episode_frames at top of every reset() (handles orchestrator crash mid-episode skipping get_step_result), and zero all buffers in cleanup() so a benchmark instance reused across runs doesn't leak frames. --------- Co-authored-by: lastdefiance20 <lastdefiance20@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 5e329a0 commit 21ba002

2 files changed

Lines changed: 348 additions & 15 deletions

File tree

docker/Dockerfile.base

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,17 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
2828
&& rm -rf /var/lib/apt/lists/*
2929

3030
# ── Vulkan ICD for NVIDIA GPU ────────────────────────────────────────
31+
# Skip writes when the ICD file is already present — on hosts whose Docker
32+
# default-runtime is `nvidia`, nvidia-container-runtime bind-mounts these
33+
# JSONs read-only into every container (including the build container), so
34+
# an unconditional write would fail with "Read-only file system".
3135
RUN mkdir -p /usr/share/vulkan/icd.d /usr/share/glvnd/egl_vendor.d \
32-
&& printf '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.2.155"}}\n' \
33-
> /usr/share/vulkan/icd.d/nvidia_icd.json \
34-
&& printf '{"file_format_version":"1.0.0","ICD":{"library_path":"libEGL_nvidia.so.0"}}\n' \
35-
> /usr/share/glvnd/egl_vendor.d/10_nvidia.json
36+
&& { [ -s /usr/share/vulkan/icd.d/nvidia_icd.json ] \
37+
|| printf '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.2.155"}}\n' \
38+
> /usr/share/vulkan/icd.d/nvidia_icd.json; } \
39+
&& { [ -s /usr/share/glvnd/egl_vendor.d/10_nvidia.json ] \
40+
|| printf '{"file_format_version":"1.0.0","ICD":{"library_path":"libEGL_nvidia.so.0"}}\n' \
41+
> /usr/share/glvnd/egl_vendor.d/10_nvidia.json; }
3642

3743
# ── Miniforge (lightweight conda, conda-forge defaults) ──────────────
3844
RUN wget -qO /tmp/miniforge.sh \

0 commit comments

Comments
 (0)