revert: drop DataRequirement / vla-eval data fetch / cache_key abstractions

MilkClouds · MilkClouds · commit 51bf04d57772 · 2026-04-30T03:20:04.000Z
Re-orient PR #58 around a smaller infrastructure change. The ``vla-eval data fetch`` subcommand + ``DataRequirement`` declarative metadata layer were over-built for the actual lifecycle: the license-acceptance handshake doesn't need to be a separate pre-flight step; it can be runtime, prompted on first need, just like model-server git clones already do. Moving the licence confirmation to runtime collapses the asymmetry between benchmark-asset fetch and model-server clone fetch — both become lazy, both go through the same primitives. This commit removes the abstraction. The next commit adds the runtime-licence flow and the unified host-cache resolver. Removed: - ``src/vla_eval/cli/cmd_data.py`` (the ``vla-eval data fetch`` subcommand and its docker-side fetch dispatch). - ``DataRequirement`` dataclass and ``Benchmark.data_requirements`` classmethod on ``src/vla_eval/benchmarks/base.py``. - ``Behavior1KBenchmark.data_requirements`` method. - ``cmd_data.register(sub)`` wiring in ``cli/main.py``. Reverted to the PR #57 baseline: - ``configs/behavior1k_eval.yaml`` — the data-fetch comment block and the OmegaConf volume interpolation; the next commit puts the interpolation back in extended XDG-aware form. - ``docs/reproductions/behavior1k.md`` step 2. - ``.claude/skills/add-benchmark/SKILL.md`` ``data_requirements`` section. Kept (independent improvements that survive this rewrite): - ``cli/_console.py``, ``cli/_docker.py`` (helper hoists). - ``cli/config_loader.py`` always-on OmegaConf interpolation. - ``Behavior1KBenchmark.task_instance_id`` per-episode sweep. - Demo-replay per-(session, episode) cursor + ``on_episode_start`` fail-loud hook. - ``Behavior1KBenchmark.get_metadata`` declaring ``action_dim=23``. - README "Build-locally images" caption + 🔒 marker on rlbench/ behavior1k rows; CONTRIBUTING benchmark roster refresh.
diff --git a/.claude/skills/add-benchmark/SKILL.md b/.claude/skills/add-benchmark/SKILL.md
@@ -121,35 +121,6 @@ class MyBenchmark(StepBenchmark):
 - **Image preprocessing**: Handle non-standard images (flipped, wrong resolution) in `make_obs()`.
 - **EGL headless rendering**: Add `os.environ.setdefault("PYOPENGL_PLATFORM", "egl")` at module top if the sim uses OpenGL.
 
-### Optional: external dataset declaration
-
-If the benchmark's dataset is licensed independently and shouldn't be baked into the docker image, override `data_requirements()` (classmethod) so the harness's uniform fetch path picks it up:
-
-```python
-from vla_eval.benchmarks.base import DataRequirement
-
-class MyBenchmark(StepBenchmark):
-    @classmethod
-    def data_requirements(cls) -> DataRequirement:
-        return DataRequirement(
-            license_id="my-dataset-tos",                # --accept-license <id>
-            license_url="https://example.com/license",
-            cache_key="my_bench",                       # host cache subdir name
-            container_data_path="/app/data",            # mount target inside the image
-            marker="dataset_ready_marker",              # file/dir whose presence skips refetch
-            download_command=("python", "-c", "<download script>"),
-        )
-```
-
-Users then run `vla-eval data fetch -c configs/<name>_eval.yaml --accept-license <license_id>` once. The fetcher mounts `${VLA_EVAL_DATA_DIR:-~/.cache/vla-eval}/<cache_key>` read-write at `container_data_path` and runs `download_command`. The eval config's `volumes:` entry should mount the same host path read-only via OmegaConf interpolation:
-
-```yaml
-volumes:
-  - "${oc.env:VLA_EVAL_DATA_DIR,${oc.env:HOME}/.cache/vla-eval}/<cache_key>:<container_data_path>:ro"
-```
-
-Reference: `Behavior1KBenchmark.data_requirements()` in `benchmarks/behavior1k/benchmark.py`.
-
 ## 3. Create config YAML
 
 Create `configs/<name>_eval.yaml`:
diff --git a/configs/behavior1k_eval.yaml b/configs/behavior1k_eval.yaml
@@ -1,11 +1,9 @@
 # BEHAVIOR-1K (OmniGibson / Isaac Sim) — 50-task household-activity suite.
 #
-# Run ``vla-eval data fetch -c configs/behavior1k_eval.yaml
-# --accept-license behavior-dataset-tos`` once before evaluating to
-# populate the dataset cache.  The default cache lives at
-# ``$VLA_EVAL_DATA_DIR/behavior1k`` (or ``~/.cache/vla-eval/behavior1k``
-# when the env var is unset); set ``VLA_EVAL_DATA_DIR`` to redirect to
-# a faster disk.  An NVIDIA GPU with Vulkan + EGL is required.
+# Before running, edit the dataset volume below to point at your local
+# BEHAVIOR-1K data directory (the one populated by the three
+# ``download_*`` calls documented in docs/reproductions/behavior1k.md).
+# An NVIDIA GPU with Vulkan + EGL is required.
 server:
   url: "ws://localhost:8000"
 
@@ -22,12 +20,11 @@ docker:
     # GPU" error and a segfault deep in omni.kit.xr on first launch.
     - "VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json"
   volumes:
-    # OmniGibson reads ``gm.DATA_PATH=/app/BEHAVIOR-1K/datasets`` at
-    # import time.  The host path resolves via OmegaConf:
-    # ``${VLA_EVAL_DATA_DIR}/behavior1k`` if set, else
-    # ``${HOME}/.cache/vla-eval/behavior1k``.  This is the same layout
-    # ``vla-eval data fetch`` writes to.
-    - "${oc.env:VLA_EVAL_DATA_DIR,${oc.env:HOME}/.cache/vla-eval}/behavior1k:/app/BEHAVIOR-1K/datasets:ro"
+    # OmniGibson reads gm.DATA_PATH=/app/BEHAVIOR-1K/datasets at import time.
+    # Replace the host side with the directory holding
+    # ``omnigibson-robot-assets/``, ``behavior-1k-assets/``, and
+    # ``2025-challenge-task-instances/``.
+    - "/data/og_data:/app/BEHAVIOR-1K/datasets:ro"
 
 output_dir: "./results"
 
diff --git a/docs/reproductions/behavior1k.md b/docs/reproductions/behavior1k.md
@@ -110,23 +110,32 @@ test set.
 - **Max steps:** 5000 default (or 2× human demo length when configured;
   see `learning/eval.py` for the dataset-driven path).
 
-## How to Reproduce (zero-action baseline, 1 task, 2000 step cap)
+## How to Reproduce (zero-action baseline, 1 task, 100 steps)
 
 ```bash
 # 1. Build the image (heavy: ~17 min, 23.5 GB).
 #    The behavior1k Dockerfile is gated behind a licence opt-in
 #    (NVIDIA Omniverse EULA — https://docs.omniverse.nvidia.com/eula/).
 docker/build.sh behavior1k --accept-license behavior1k
 
-# 2. Download the dataset (~35 GiB) into the harness cache.  This drives
-#    the official ``download_omnigibson_robot_assets`` /
-#    ``download_behavior_1k_assets`` / ``download_2025_challenge_task_instances``
-#    helpers inside the image and accepts the BEHAVIOR Dataset ToS.  The
-#    cache lives at ``$VLA_EVAL_DATA_DIR/behavior1k`` (defaults to
-#    ``~/.cache/vla-eval/behavior1k``) — set ``VLA_EVAL_DATA_DIR`` to
-#    redirect to a faster disk before running.
-uv run vla-eval data fetch -c configs/behavior1k_eval.yaml \
-    --accept-license behavior-dataset-tos
+# 2. Download the dataset (~35 GiB).  Mount-target inside the image
+#    is /app/BEHAVIOR-1K/datasets — that's where gm.DATA_PATH points.
+mkdir -p /path/to/og_data
+docker run --rm --gpus all \
+  -e OMNI_KIT_ACCEPT_EULA=YES \
+  -v /path/to/og_data:/app/BEHAVIOR-1K/datasets \
+  --entrypoint conda \
+  ghcr.io/allenai/vla-evaluation-harness/behavior1k:latest \
+  run --no-capture-output -n behavior python -c "
+from omnigibson.utils.asset_utils import (
+    download_omnigibson_robot_assets,
+    download_behavior_1k_assets,
+    download_2025_challenge_task_instances,
+)
+download_omnigibson_robot_assets()
+download_behavior_1k_assets(accept_license=True)
+download_2025_challenge_task_instances()
+"
 
 # 3. Start the zero-action baseline server.
 uv run --script src/vla_eval/model_servers/behavior1k_baseline.py \
@@ -140,10 +149,7 @@ uv run vla-eval run -c configs/behavior1k_eval.yaml \
     --gpus 0 --yes
 ```
 
-The eval config picks up the cache directory automatically (the
-``volumes`` entry resolves
-``${VLA_EVAL_DATA_DIR}/behavior1k`` with a fallback to
-``${HOME}/.cache/vla-eval/behavior1k``); no per-host edits required.
+Edit `configs/behavior1k_eval.yaml` `volumes` to point at your dataset path.
 
 ## What Trained-VLA Reproduction Still Needs
 
diff --git a/src/vla_eval/benchmarks/base.py b/src/vla_eval/benchmarks/base.py
@@ -33,26 +33,6 @@ class StepResult:
     info: dict[str, Any]
 
 
-@dataclass(frozen=True)
-class DataRequirement:
-    """Declares a benchmark's externally-licensed dataset.
-
-    The CLI uses this to drive ``vla-eval data fetch``: it mounts
-    ``${VLA_EVAL_DATA_DIR:-~/.cache/vla-eval}/<cache_key>`` at
-    ``container_data_path`` (read-write) and runs ``download_command``.
-    ``marker`` is a host-relative path the download produces last; its
-    presence short-circuits re-fetches.  ``license_id`` is the
-    user-facing kebab-case token compared against ``--accept-license``.
-    """
-
-    license_id: str
-    license_url: str
-    cache_key: str
-    container_data_path: str
-    marker: str
-    download_command: tuple[str, ...]
-
-
 # ---------------------------------------------------------------------------
 # Async Benchmark ABC (parent)
 # ---------------------------------------------------------------------------
@@ -152,14 +132,6 @@ def get_metadata(self) -> dict[str, Any]:
         """Return benchmark defaults and metadata. Optional override."""
         return {}
 
-    @classmethod
-    def data_requirements(cls) -> DataRequirement | None:
-        """Optional: declare an external dataset for ``vla-eval data fetch``.
-
-        Default ``None`` — most benchmarks bundle data in the docker image.
-        """
-        return None
-
     def cleanup(self) -> None:
         """Release benchmark resources (environments, renderers, etc.). Optional override."""
 
diff --git a/src/vla_eval/benchmarks/behavior1k/benchmark.py b/src/vla_eval/benchmarks/behavior1k/benchmark.py
@@ -32,7 +32,7 @@
 import numpy as np
 from anyio.to_thread import run_sync as _run_in_thread
 
-from vla_eval.benchmarks.base import DataRequirement, StepBenchmark, StepResult
+from vla_eval.benchmarks.base import StepBenchmark, StepResult
 from vla_eval.specs import IMAGE_RGB, LANGUAGE, RAW, DimSpec
 from vla_eval.types import Action, EpisodeResult, Observation, Task
 
@@ -213,42 +213,6 @@ def __init__(
         self._current_task_name: str | None = None
         self._available_tasks: dict[str, Any] | None = None
 
-    # ------------------------------------------------------------------
-    # Data fetch
-    # ------------------------------------------------------------------
-
-    @classmethod
-    def data_requirements(cls) -> DataRequirement:
-        # The download_* helpers are idempotent (no-op when files exist);
-        # the 2025-challenge task instances are written last, so its
-        # presence implies the prior two completed.
-        download_script = (
-            "from omnigibson.utils.asset_utils import ("
-            "download_omnigibson_robot_assets, "
-            "download_behavior_1k_assets, "
-            "download_2025_challenge_task_instances); "
-            "download_omnigibson_robot_assets(); "
-            "download_behavior_1k_assets(accept_license=True); "
-            "download_2025_challenge_task_instances()"
-        )
-        return DataRequirement(
-            license_id="behavior-dataset-tos",
-            license_url="https://behavior.stanford.edu/dataset",
-            cache_key="behavior1k",
-            container_data_path="/app/BEHAVIOR-1K/datasets",
-            marker="2025-challenge-task-instances",
-            download_command=(
-                "conda",
-                "run",
-                "--no-capture-output",
-                "-n",
-                "behavior",
-                "python",
-                "-c",
-                download_script,
-            ),
-        )
-
     # ------------------------------------------------------------------
     # Lazy initialization
     # ------------------------------------------------------------------
diff --git a/src/vla_eval/cli/cmd_data.py b/src/vla_eval/cli/cmd_data.py
diff --git a/src/vla_eval/cli/main.py b/src/vla_eval/cli/main.py