feat(dirs): unified XDG asset cache + runtime licence confirmation#58
Merged
MilkClouds merged 14 commits intomainfrom Apr 30, 2026
Merged
feat(dirs): unified XDG asset cache + runtime licence confirmation#58MilkClouds merged 14 commits intomainfrom
MilkClouds merged 14 commits intomainfrom
Conversation
Adds the BEHAVIOR-1K (OmniGibson + NVIDIA Isaac Sim 4.5.0) benchmark. The integration covers the standard StepBenchmark surface plus a demo-replay model server used to verify the dataloader against the released LeRobot v2.1 trajectories. What's added: - ``src/vla_eval/benchmarks/behavior1k/benchmark.py`` Behavior1KBenchmark with the required StepBenchmark methods. R1Pro robot, 23-D action, RGB head + L/R wrist cameras. The async bridge is overridden so reset/step/cleanup run on a worker thread (Isaac Sim's SimulationApp ``signal.signal`` calls have to be monkey-patched while the worker thread is set up — they assume the main thread). - ``src/vla_eval/model_servers/behavior1k_baseline.py`` Zero-action baseline (smoke-test sanity check). - ``src/vla_eval/model_servers/behavior1k_demo_replay.py`` Plays back the recorded actions from a LeRobot v2.1 parquet episode. Used to verify the env wiring matches the dataset. - ``docker/Dockerfile.behavior1k`` Isaac Sim 4.5.0 from pypi.nvidia.com (26 omni-* / isaacsim-* wheels), bddl3 + OmniGibson[eval] + joylo from BEHAVIOR-1K v3.7.2. Gated behind ``ARG ACCEPT_NVIDIA_EULA=YES`` (NVIDIA Omniverse EULA, see https://docs.omniverse.nvidia.com/eula/). - ``configs/behavior1k_eval.yaml`` — turning_on_radio task instance 1 - ``configs/model_servers/behavior1k/baseline.yaml`` — zero-action server - ``docs/reproductions/behavior1k.md`` — repro write-up + data files The behavior1k entry is registered in ``docker/build.sh`` (gated via ``--accept-license behavior1k``) and listed under ``NO_REDIST`` in ``docker/push.sh`` so the image is built locally only. Verification: - demo-replay on turning_on_radio (task instance 1) → success=True at step 1364/2000 (within the human-annotated press-skill window [1162, 1434] from the BEHAVIOR Dataset annotations). - zero-action baseline → success=False at max_steps (expected). Skill notes (``.claude/skills/add-benchmark`` and ``add-model-server``) gain a short reminder not to add ``tests/test_<name>_benchmark.py`` or ``tests/test_<name>_server.py`` with mocked sim/model libraries — ``tests/`` is for harness mechanics, not per-sim integration, and mocked modules drift from upstream and miss real bugs.
Companion to the baseline.yaml — points the demo-replay server at a LeRobot v2.1 parquet episode. ``demo_path`` is a placeholder; users swap in their own path before running.
CI surfaced three ty errors after the rebase: - ``anyio.to_thread.run_sync(...)`` was unresolved through the module attribute path. Use the same import-as-name style the rest of the codebase already uses (``predict.py``, ``serve.py``, ``rtc.py``). - ``signal.signal = lambda ...`` triggered ``invalid-assignment``. Use ``setattr`` so the rebinding is opaque to the type checker (the runtime behaviour — restoring the handler in ``finally`` — is unchanged). - Drop the leftover ``# type: ignore`` mypy-style pragmas that were carrying the old workarounds; ty doesn't honour them anyway. While here, refresh the docs that mention benchmark coverage: - ``README.md``: BEHAVIOR-1K badge promoted from ``planned`` to ``integrated``; rlbench dropped from the registry-pulled image table; new "Build-locally images" note covering rlbench and behavior1k; build-script example shows ``--accept-license``. - ``CONTRIBUTING.md``: integrated-benchmark roster updated to match the actual contents of ``src/vla_eval/benchmarks/`` (was missing LIBERO-Plus/Mem, RoboMME, MolmoSpaces, Kinetix; now also adds BEHAVIOR-1K).
Build-locally images (rlbench, behavior1k) now appear in both the top-of-readme support table and the Docker Images table, with a 🔒 marker indicating they're not pulled from ghcr.io and require an explicit licence opt-in. - Top support table: 🔒 appended after the rlbench and behavior1k badges. Status legend gains a fourth entry explaining 🔒. - Docker Images table: rlbench is restored (was dropped in the prior pass), behavior1k is added at its 23.6 GB position. For both, the Image column shows the name without a ghcr.io link, and the row carries 🔒. - Replaces the earlier "Build-locally images" paragraph with a single caption under the table that explains the marker.
Reverts the 🔒 markers added next to the RLBench / BEHAVIOR-1K badges in the top support table, and the matching legend entry. Build-mechanism details belong in the Docker Images table further down — the support table just tracks integration / reproduction status.
Three issues from review: - ``Behavior1KBenchmark.task_instance_id`` was set once at construction and never varied, so ``episodes_per_task > 1`` runs reloaded the same TRO state every episode (and aggregate scores could not match the 50-task × 10-instance challenge protocol). Accept ``int | list[int] | None`` and index by ``task["episode_idx"]`` cyclically when a list is given; the scalar form preserves the demo-replay use case. - ``Behavior1KDemoReplayModelServer`` kept a single ``_current_episode_id`` / ``_step_idx`` for the whole process, so two concurrent benchmark sessions on one server would race and consume a mixed action stream. Key the cursor on ``(session_id, episode_id)``, initialise in ``on_episode_start`` and free in ``on_episode_end`` so the dict stays bounded. - ``docs/reproductions/behavior1k.md`` build command did not pass ``--accept-license behavior1k``, so the new gated build skipped the image and the next step failed with "image not found". Updated the command and added the licence URL inline.
Some benchmarks ship the simulator in the docker image but expect the
dataset to come from a separate, licence-restricted source (BEHAVIOR-1K
is the first concrete consumer — its dataset is governed by the
BEHAVIOR Dataset ToS and OmniGibson assets). Previously, users had to
manually run a docker invocation that wired ``download_*`` helpers
inside the image and then edit ``configs/behavior1k_eval.yaml`` so the
volume pointed at their host download directory. Two awkward steps
that don't appear anywhere a first-time reader would naturally look.
This change introduces a uniform mechanism the harness can use for any
benchmark with the same shape:
- ``Benchmark.data_requirements()`` (classmethod, base default returns
``None``) declares a :class:`DataRequirement`: licence id + URL,
the in-container data path, a marker file, and the argv to run
inside the image. ``Behavior1KBenchmark`` returns the BEHAVIOR
Dataset ToS opt-in plus the canonical ``download_*`` helper script.
- New ``vla-eval data fetch -c <config> --accept-license <id>`` CLI:
resolves the benchmark class, mounts a host cache directory
read-write at the container's data path, and runs the declared
download command. Idempotent (skips when the marker is already
present). Symmetric with ``docker/build.sh --accept-license``,
so opt-in surface is the same shape across build and fetch.
- Default cache path: ``${VLA_EVAL_DATA_DIR}/<benchmark>`` if the env
var is set, else ``${HOME}/.cache/vla-eval/<benchmark>``.
``--data-dir`` overrides explicitly. ``configs/behavior1k_eval.yaml``
resolves the same expression in its ``volumes:`` line via OmegaConf,
so a fresh clone + ``vla-eval data fetch`` + ``vla-eval run``
works without any per-host config edits.
- ``cli/config_loader.py`` now always runs ``OmegaConf.to_container``
with ``resolve=True`` (previously only on configs that used
``extends:``), so ``${oc.env:VAR,default}`` interpolations are
honoured uniformly. No-op for configs without interpolations.
- ``docs/reproductions/behavior1k.md`` replaces the manual docker
download incantation with the new ``vla-eval data fetch`` step
and notes the auto-resolved cache path.
Three concrete cleanups from the simplify pass: - Use ``gpu_docker_flag(...)`` instead of building ``--gpus`` by hand. The previous form passed ``--gpus 0`` literally; ``docker run`` needs ``--gpus device=0`` for non-``all`` specs, so this was a real bug that only manifested when ``--gpus 0`` was passed at fetch time. - Add ``cache_key`` to ``DataRequirement`` so the cache subdir is declared by the benchmark instead of derived from import-string parsing (the old form sliced ``parts[-2]`` from the module path, which silently couples the CLI to a package layout convention). - De-duplicate ``_stderr_console`` — the new fetch handler had reimplemented it with a different rich Console signature. Hoisted to ``cli/_console.py``; both ``main.py`` and ``cmd_data.py`` now import it from one place. While here: - Drop the redundant ``image`` parameter from ``_build_docker_argv`` (it was always ``docker_cfg.image``). - Drop a dead ``or []`` branch in ``set(args.accept_license or [])`` (argparse already defaults to ``[]``). - Trim the ``DataRequirement`` and ``Behavior1KBenchmark.data_requirements`` docstrings to match the project's existing density. - Refresh the ``config_loader.load_config`` docstring (it claimed no-extends configs went through ``yaml.safe_load`` only, which is no longer true since the prior commit always resolved interpolations).
Two issues from review:
- ``Behavior1KBenchmark.get_metadata()`` did not declare
``action_dim``, so realtime modes (``Orchestrator`` reads
``metadata.get("action_dim", 7)`` to size the action buffer)
would fall back to 7 and instantly crash on the first
``step()`` once it saw a 23-D R1Pro action. Add the entry.
- Drop ``--data-dir`` from ``vla-eval data fetch``. When users
passed ``--data-dir /custom`` for fetch, the dataset went there,
but ``vla-eval run`` mounts the path resolved by the YAML's
``${VLA_EVAL_DATA_DIR:-~/.cache/vla-eval}/<cache_key>`` —
fetch-target and run-target diverged silently. ``VLA_EVAL_DATA_DIR``
is the single source of truth for the cache base; both fetch and
run honour it. Users who need a different disk set the env var
once.
Six concrete improvements from the post-/simplify deep review: - Hoist docker daemon/image helpers out of ``cli/main.py`` into ``cli/_docker.py`` so ``cmd_data`` can reuse them. ``cmd_data_fetch`` now ``check_docker_daemon``-s the host and ``image_exists_locally``-s the configured image; if the image isn't built yet it points the user at the right ``docker/build.sh ... --accept-license ...`` invocation instead of letting docker fail with a manifest-not-found error a few seconds later (``NO_REDIST`` images can't be pulled). - ``cmd_data._resolve_benchmark_class`` now warns when a config has more than one benchmark entry — the fetcher only acts on the first, and the silent skip was a footgun for multi-benchmark configs. - ``Behavior1KDemoReplayModelServer.predict`` raises if it's called without a prior ``on_episode_start`` for that ``(session, episode)`` key. The previous ``_step_idx.get(key, 0)`` silently restarted the cursor at 0, which would corrupt a replay if the harness ever called ``predict`` before sending ``EPISODE_START`` (e.g. on a mid- episode reconnect). Matches the explicit ``RuntimeError`` pattern ``predict.py`` uses elsewhere for protocol violations. - ``docs/reproductions/behavior1k.md`` "How to Reproduce" header said ``100 steps``, but the config sets ``max_steps: 2000``; fix the header. - ``.claude/skills/add-benchmark/SKILL.md`` gains a section on ``data_requirements()`` so a contributor adding a new benchmark with externally-licensed data discovers the fetch hook through the skill rather than by reading ``behavior1k`` source. - ``docker/build.sh`` ``EULA_GATED`` and ``docker/push.sh`` ``NO_REDIST`` now cross-reference each other in their headers. These two lists must stay in sync (any image gated on build is also unpushable); the comments make the invariant visible to future contributors.
4e43a6f to
25eae2f
Compare
…ctions Re-orient PR #58 around a smaller infrastructure change. The ``vla-eval data fetch`` subcommand + ``DataRequirement`` declarative metadata layer were over-built for the actual lifecycle: the license-acceptance handshake doesn't need to be a separate pre-flight step; it can be runtime, prompted on first need, just like model-server git clones already do. Moving the licence confirmation to runtime collapses the asymmetry between benchmark-asset fetch and model-server clone fetch — both become lazy, both go through the same primitives. This commit removes the abstraction. The next commit adds the runtime-licence flow and the unified host-cache resolver. Removed: - ``src/vla_eval/cli/cmd_data.py`` (the ``vla-eval data fetch`` subcommand and its docker-side fetch dispatch). - ``DataRequirement`` dataclass and ``Benchmark.data_requirements`` classmethod on ``src/vla_eval/benchmarks/base.py``. - ``Behavior1KBenchmark.data_requirements`` method. - ``cmd_data.register(sub)`` wiring in ``cli/main.py``. Reverted to the PR #57 baseline: - ``configs/behavior1k_eval.yaml`` — the data-fetch comment block and the OmegaConf volume interpolation; the next commit puts the interpolation back in extended XDG-aware form. - ``docs/reproductions/behavior1k.md`` step 2. - ``.claude/skills/add-benchmark/SKILL.md`` ``data_requirements`` section. Kept (independent improvements that survive this rewrite): - ``cli/_console.py``, ``cli/_docker.py`` (helper hoists). - ``cli/config_loader.py`` always-on OmegaConf interpolation. - ``Behavior1KBenchmark.task_instance_id`` per-episode sweep. - Demo-replay per-(session, episode) cursor + ``on_episode_start`` fail-loud hook. - ``Behavior1KBenchmark.get_metadata`` declaring ``action_dim=23``. - README "Build-locally images" caption + 🔒 marker on rlbench/ behavior1k rows; CONTRIBUTING benchmark roster refresh.
Single host-cache resolver, one git-clone helper, one licence-confirm helper. All three sites that previously rolled their own resolution (``cli/cmd_data``, ``model_servers/vlanext``, ``model_servers/mme_vla``) now go through them. License acceptance moves from the explicit pre-flight ``vla-eval data fetch`` step (reverted in the previous commit) to a runtime stdin prompt inside the eval container, with a ``--accept-license`` / ``$VLA_EVAL_ACCEPTED_LICENSES`` non-interactive escape hatch. New module ``vla_eval/dirs.py`` (mirrors HuggingFace's ``HF_HOME`` / ``HF_ASSETS_CACHE`` precedence shape): - ``home()`` — ``$VLA_EVAL_HOME > $XDG_CACHE_HOME/vla-eval > ~/.cache/vla-eval``. - ``assets_cache(subdir)`` — ``$VLA_EVAL_ASSETS_CACHE > home()/assets``, with optional per-component subdir. - ``ensure_git_clone(name, repo, rev, *, shallow=False)`` — lazy clone into ``assets_cache(name)`` at the pinned rev. Idempotent. - ``ensure_license(license_id, *, url, description)`` — env-var bypass first (``$VLA_EVAL_ACCEPTED_LICENSES``), then interactive stdin prompt (``isatty``), else ``SystemExit`` with a hint pointing at ``--accept-license`` / the env var. Tests in ``tests/test_dirs.py`` cover env-var precedence + every ``ensure_license`` branch. Behavior1KBenchmark gains ``_ensure_assets(data_path)`` called from ``_init_og``. First-run on a fresh host prompts for the BEHAVIOR Dataset ToS via ``ensure_license`` and runs the three OmniGibson ``download_*`` helpers. No more separate fetch step; the eval container itself populates its own cache. Model servers ``vlanext`` and ``mme_vla`` lose their hand-rolled ``$XDG_CACHE_HOME/vla-eval/<name>`` resolution and the ``git clone or skip`` block; both now call ``assets_cache`` + ``ensure_git_clone``. ``VLANEXT_ROOT`` (vlanext-specific editable-mode override that bypasses the cache entirely) stays at the call site. Orchestrator (``cli/main.py``) changes: - Adds ``--accept-license <id>`` (repeatable) to ``vla-eval run``; forwarded into the eval container as ``VLA_EVAL_ACCEPTED_LICENSES=id1,id2,…``. - Conditionally appends ``-i``/``-it`` to ``docker run`` so the in-container licence prompt can read the host's stdin in interactive contexts. Config / docs: - ``configs/behavior1k_eval.yaml`` volumes use the four-level OmegaConf fallback chain that mirrors ``assets_cache``'s Python precedence, and drops ``:ro`` so the in-container download can populate the host cache. - ``docs/reproductions/behavior1k.md`` collapses the old two-step reproduction (manual ``docker run`` download + eval) into a single ``vla-eval run --accept-license behavior-dataset-tos`` invocation. - ``.claude/skills/add-benchmark/SKILL.md`` gains a small "Optional: external dataset acquisition" section pointing at ``vla_eval.dirs.ensure_license`` + ``assets_cache`` and the ``_init_og``-style lazy pattern.
Five concrete cleanups from the post-feat review: - ``ACCEPTED_LICENSES_ENV`` constant in ``dirs.py``; the env var name no longer repeats as a string literal across modules. - ``tty_docker_flags()`` helper in ``docker_resources.py`` next to ``gpu_docker_flag`` / ``shard_docker_flags``. Replaces the inline ``isatty`` ladder in ``_run_via_docker``. - Two unit tests for ``ensure_git_clone`` (idempotent short-circuit when ``.git`` exists; argv shape for ``shallow=True``). The network call still isn't unit-tested but the dispatch is now covered. - ``ensure_license`` non-interactive message gets a banner box + ``[vla-eval]`` prefix + concrete two-line "re-run with one of" block, so the licence prompt isn't drowned in OmniGibson's initialisation logs. - Docstring / comment widths re-flowed up to the project's 119 line-length (was wrapping ~70 chars by reflex).
Sweep across the surface PR #58 touches: comments and docstrings that were wrapping at ~62-72 chars get reflowed to fit the project's 119 line-length, and a small set of pure-narration banner comments are removed. Reflowed (no information loss, just wider lines): - ``benchmarks/behavior1k/benchmark.py`` — module docstring, class docstring + ``task_instance_id`` Args block, ``_load_task_instance`` docstring, signal-handler / cleanup / async-bridge / start_episode comment blocks. - ``model_servers/behavior1k_demo_replay.py`` — module docstring, parquet ``columns=`` rationale, per-(session, episode) cursor note. - ``model_servers/behavior1k_baseline.py`` — module docstring. - ``model_servers/vlanext.py`` — ``_ensure_vlanext`` docstring, shallow-clone rationale, class docstring. - ``model_servers/mme_vla.py`` — module-level shallow-clone rationale, class docstring + Args block. - ``docker_resources.py`` — ``tty_docker_flags`` summary, the OMP/MKL block. Removed (clearly noise, not navigation aids): - ``cli/main.py`` `# CLI override for ...` / `# Validate shard args` / `# Resource allocation` / `# Decide whether to run via Docker` banners are kept (they aid navigation in long ``cmd_run``). The truly redundant ones — `# Extra volumes from config`, `# Extra env vars`, `# 1. CWD (running from repo root)` — go. - ``cli/_docker.py`` module docstring's stale ``cmd_data`` reference.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Single host-cache resolver and licence-confirm primitive shared by every site in the harness that needs to put something on disk and (optionally) gate it behind a user opt-in. Replaces the per-site re-implementations of
$XDG_CACHE_HOME/vla-eval/<name>and the per-servergit clone or skipblocks; collapses the BEHAVIOR-1K reproduction from a two-step dance (manual docker download → eval) to onevla-eval runinvocation.What's in here
New:
src/vla_eval/dirs.py(mirrors HuggingFace'sHF_HOME/HF_ASSETS_CACHEprecedence shape)home()$VLA_EVAL_HOME > $XDG_CACHE_HOME/vla-eval > ~/.cache/vla-evalassets_cache(subdir)$VLA_EVAL_ASSETS_CACHE > home()/assets(+ optional subdir)ensure_git_clone(name, repo, rev, *, shallow)ensure_license(license_id, *, url, description)$VLA_EVAL_ACCEPTED_LICENSESenv var → stdin prompt (isatty) →SystemExitwith hintLayout when no env vars are set:
Future namespaces (e.g.
datasets/for HF Hub datasets,tokens/for auth) sit alongsideassets/underhome()without adding new env vars.Runtime licence-confirm flow
Behavior1KBenchmark._ensure_assets(data_path), called from_init_og, fetches BEHAVIOR-1K scene + task files on first run via the OmniGibsondownload_*helpers and gates the prompt throughensure_license.vla-eval runadds--accept-license <id>(repeatable). The orchestrator forwards it into the eval container asVLA_EVAL_ACCEPTED_LICENSES=id1,id2,…soensure_licenseskips the stdin prompt non-interactively._run_via_dockerconditionally adds-i/-itbased on hostisatty, so an interactive user sees the prompt and a CI/sharded run goes straight through the env-var bypass.Existing model-server clones migrate to the shared primitives
vlanext.py:_VLANEXT_CACHE→assets_cache("vlanext"); the manualgit clonelines →ensure_git_clone(name="vlanext", …). TheVLANEXT_ROOTeditable-mode override is unchanged.mme_vla.py: same pattern,shallow=Truefor the branch clone.Config & docs
configs/behavior1k_eval.yamlvolume uses the same four-level OmegaConf fallback chain thatassets_cacheresolves to in Python; drops:roso the in-container fetch can populate the host cache.docs/reproductions/behavior1k.mdcollapses the old two-step "fetch then run" into a singlevla-eval run --accept-license behavior-dataset-tos -c …invocation..claude/skills/add-benchmark/SKILL.mdgains a small "Optional: external dataset acquisition" section pointing at the new pattern.Earlier-commit improvements still in this PR (carry-over)
Behavior1KBenchmark.task_instance_idacceptsint | list[int] | Noneand sweeps viaepisode_idx % len(list), so the 50-task × 10-instance challenge protocol is reachable.Behavior1KDemoReplayModelServerstep cursor is keyed on(session_id, episode_id)and managed viaon_episode_start/on_episode_end;predict()raises if it's called beforeon_episode_startfor that key.Behavior1KBenchmark.get_metadata()declaresaction_dim=23so realtime modes don't fall back to the orchestrator's 7-D default.README.mdDocker Images table marksrlbenchandbehavior1krows as build-locally with a 🔒 marker;CONTRIBUTING.mdbenchmark roster refreshed.cli/_console.pyandcli/_docker.pyhost docker / console helpers shared bycli/main.py.cli/config_loader.pyalways runsOmegaConf.to_container(resolve=True)so${oc.env:…}interpolations work uniformly across configs.Verification
End-to-end docker invocation hasn't been re-run on the cluster (would re-download ~35 GiB for no extra coverage); the existing reproduction artefact under
docs/reproductions/data/still applies.Checklist
Code changes:
make checkpasses (ruff + ty)make testpasses — 306 passed, 1 skippedSmoke test commands run: