feat(cli): add vla-eval data fetch for external datasets

MilkClouds · MilkClouds · commit 29943b1a92e4 · 2026-04-29T16:54:44.000Z
Some benchmarks ship the simulator in the docker image but expect the
dataset to come from a separate, licence-restricted source (BEHAVIOR-1K
is the first concrete consumer — its dataset is governed by the
BEHAVIOR Dataset ToS and OmniGibson assets).  Previously, users had to
manually run a docker invocation that wired ``download_*`` helpers
inside the image and then edit ``configs/behavior1k_eval.yaml`` so the
volume pointed at their host download directory.  Two awkward steps
that don't appear anywhere a first-time reader would naturally look.

This change introduces a uniform mechanism the harness can use for any
benchmark with the same shape:

- ``Benchmark.data_requirements()`` (classmethod, base default returns
  ``None``) declares a :class:`DataRequirement`: licence id + URL,
  the in-container data path, a marker file, and the argv to run
  inside the image.  ``Behavior1KBenchmark`` returns the BEHAVIOR
  Dataset ToS opt-in plus the canonical ``download_*`` helper script.

- New ``vla-eval data fetch -c &lt;config&gt; --accept-license &lt;id&gt;`` CLI:
  resolves the benchmark class, mounts a host cache directory
  read-write at the container's data path, and runs the declared
  download command.  Idempotent (skips when the marker is already
  present).  Symmetric with ``docker/build.sh --accept-license``,
  so opt-in surface is the same shape across build and fetch.

- Default cache path: ``${VLA_EVAL_DATA_DIR}/&lt;benchmark&gt;`` if the env
  var is set, else ``${HOME}/.cache/vla-eval/&lt;benchmark&gt;``.
  ``--data-dir`` overrides explicitly.  ``configs/behavior1k_eval.yaml``
  resolves the same expression in its ``volumes:`` line via OmegaConf,
  so a fresh clone + ``vla-eval data fetch`` + ``vla-eval run``
  works without any per-host config edits.

- ``cli/config_loader.py`` now always runs ``OmegaConf.to_container``
  with ``resolve=True`` (previously only on configs that used
  ``extends:``), so ``${oc.env:VAR,default}`` interpolations are
  honoured uniformly.  No-op for configs without interpolations.

- ``docs/reproductions/behavior1k.md`` replaces the manual docker
  download incantation with the new ``vla-eval data fetch`` step
  and notes the auto-resolved cache path.
diff --git a/configs/behavior1k_eval.yaml b/configs/behavior1k_eval.yaml
@@ -1,9 +1,11 @@
 # BEHAVIOR-1K (OmniGibson / Isaac Sim) — 50-task household-activity suite.
 #
-# Before running, edit the dataset volume below to point at your local
-# BEHAVIOR-1K data directory (the one populated by the three
-# ``download_*`` calls documented in docs/reproductions/behavior1k.md).
-# An NVIDIA GPU with Vulkan + EGL is required.
+# Run ``vla-eval data fetch -c configs/behavior1k_eval.yaml
+# --accept-license behavior-dataset-tos`` once before evaluating to
+# populate the dataset cache.  The default cache lives at
+# ``$VLA_EVAL_DATA_DIR/behavior1k`` (or ``~/.cache/vla-eval/behavior1k``
+# when the env var is unset); set ``VLA_EVAL_DATA_DIR`` to redirect to
+# a faster disk.  An NVIDIA GPU with Vulkan + EGL is required.
 server:
   url: "ws://localhost:8000"
 
@@ -20,11 +22,12 @@ docker:
     # GPU" error and a segfault deep in omni.kit.xr on first launch.
     - "VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json"
   volumes:
-    # OmniGibson reads gm.DATA_PATH=/app/BEHAVIOR-1K/datasets at import time.
-    # Replace the host side with the directory holding
-    # ``omnigibson-robot-assets/``, ``behavior-1k-assets/``, and
-    # ``2025-challenge-task-instances/``.
-    - "/data/og_data:/app/BEHAVIOR-1K/datasets:ro"
+    # OmniGibson reads ``gm.DATA_PATH=/app/BEHAVIOR-1K/datasets`` at
+    # import time.  The host path resolves via OmegaConf:
+    # ``${VLA_EVAL_DATA_DIR}/behavior1k`` if set, else
+    # ``${HOME}/.cache/vla-eval/behavior1k``.  This is the same layout
+    # ``vla-eval data fetch`` writes to.
+    - "${oc.env:VLA_EVAL_DATA_DIR,${oc.env:HOME}/.cache/vla-eval}/behavior1k:/app/BEHAVIOR-1K/datasets:ro"
 
 output_dir: "./results"
 
diff --git a/docs/reproductions/behavior1k.md b/docs/reproductions/behavior1k.md
@@ -118,24 +118,15 @@ test set.
 #    (NVIDIA Omniverse EULA — https://docs.omniverse.nvidia.com/eula/).
 docker/build.sh behavior1k --accept-license behavior1k
 
-# 2. Download the dataset (~35 GiB).  Mount-target inside the image
-#    is /app/BEHAVIOR-1K/datasets — that's where gm.DATA_PATH points.
-mkdir -p /path/to/og_data
-docker run --rm --gpus all \
-  -e OMNI_KIT_ACCEPT_EULA=YES \
-  -v /path/to/og_data:/app/BEHAVIOR-1K/datasets \
-  --entrypoint conda \
-  ghcr.io/allenai/vla-evaluation-harness/behavior1k:latest \
-  run --no-capture-output -n behavior python -c "
-from omnigibson.utils.asset_utils import (
-    download_omnigibson_robot_assets,
-    download_behavior_1k_assets,
-    download_2025_challenge_task_instances,
-)
-download_omnigibson_robot_assets()
-download_behavior_1k_assets(accept_license=True)
-download_2025_challenge_task_instances()
-"
+# 2. Download the dataset (~35 GiB) into the harness cache.  This drives
+#    the official ``download_omnigibson_robot_assets`` /
+#    ``download_behavior_1k_assets`` / ``download_2025_challenge_task_instances``
+#    helpers inside the image and accepts the BEHAVIOR Dataset ToS.  The
+#    cache lives at ``$VLA_EVAL_DATA_DIR/behavior1k`` (defaults to
+#    ``~/.cache/vla-eval/behavior1k``) — set ``VLA_EVAL_DATA_DIR`` to
+#    redirect to a faster disk before running.
+uv run vla-eval data fetch -c configs/behavior1k_eval.yaml \
+    --accept-license behavior-dataset-tos
 
 # 3. Start the zero-action baseline server.
 uv run --script src/vla_eval/model_servers/behavior1k_baseline.py \
@@ -149,7 +140,10 @@ uv run vla-eval run -c configs/behavior1k_eval.yaml \
     --gpus 0 --yes
 ```
 
-Edit `configs/behavior1k_eval.yaml` `volumes` to point at your dataset path.
+The eval config picks up the cache directory automatically (the
+``volumes`` entry resolves
+``${VLA_EVAL_DATA_DIR}/behavior1k`` with a fallback to
+``${HOME}/.cache/vla-eval/behavior1k``); no per-host edits required.
 
 ## What Trained-VLA Reproduction Still Needs
 
diff --git a/src/vla_eval/benchmarks/base.py b/src/vla_eval/benchmarks/base.py
@@ -33,6 +33,41 @@ class StepResult:
     info: dict[str, Any]
 
 
+@dataclass(frozen=True)
+class DataRequirement:
+    """External-data requirement that can't be redistributed in the image.
+
+    Benchmarks whose dataset is licensed independently of the harness
+    (e.g. BEHAVIOR-1K's BEHAVIOR Dataset ToS) declare a
+    ``DataRequirement`` from their ``data_requirements()`` classmethod
+    so the CLI can drive a uniform fetch flow.
+
+    Fields:
+        license_id: Token a user passes to ``--accept-license`` to
+            opt in.  Should be lower-kebab-case and stable
+            (e.g. ``"behavior-dataset-tos"``).
+        license_url: Where to read the licence terms.
+        container_data_path: Path inside the docker image where the
+            data must be mounted.  Used as the mount target for both
+            ``vla-eval data fetch`` (read-write) and ``vla-eval run``
+            (read-only).
+        marker: Path relative to the *host* data directory that, once
+            present, signals the dataset is fetched.  Used for the
+            "already-fetched" short-circuit.  Pick something that the
+            download command produces last (a final asset directory,
+            for instance).
+        download_command: argv that the docker container will run, with
+            ``container_data_path`` mounted read-write, to populate
+            the dataset.
+    """
+
+    license_id: str
+    license_url: str
+    container_data_path: str
+    marker: str
+    download_command: tuple[str, ...]
+
+
 # ---------------------------------------------------------------------------
 # Async Benchmark ABC (parent)
 # ---------------------------------------------------------------------------
@@ -132,6 +167,19 @@ def get_metadata(self) -> dict[str, Any]:
         """Return benchmark defaults and metadata. Optional override."""
         return {}
 
+    @classmethod
+    def data_requirements(cls) -> DataRequirement | None:
+        """Declare an external-data dependency that the harness can fetch.
+
+        Most benchmarks bundle their data inside the docker image and
+        return ``None`` (the default).  Benchmarks whose dataset is
+        licensed independently of the harness (e.g. BEHAVIOR-1K)
+        return a populated :class:`DataRequirement` so
+        ``vla-eval data fetch -c <config>`` can drive a uniform
+        download flow.
+        """
+        return None
+
     def cleanup(self) -> None:
         """Release benchmark resources (environments, renderers, etc.). Optional override."""
 
diff --git a/src/vla_eval/benchmarks/behavior1k/benchmark.py b/src/vla_eval/benchmarks/behavior1k/benchmark.py
@@ -32,7 +32,7 @@
 import numpy as np
 from anyio.to_thread import run_sync as _run_in_thread
 
-from vla_eval.benchmarks.base import StepBenchmark, StepResult
+from vla_eval.benchmarks.base import DataRequirement, StepBenchmark, StepResult
 from vla_eval.specs import IMAGE_RGB, LANGUAGE, RAW, DimSpec
 from vla_eval.types import Action, EpisodeResult, Observation, Task
 
@@ -213,6 +213,48 @@ def __init__(
         self._current_task_name: str | None = None
         self._available_tasks: dict[str, Any] | None = None
 
+    # ------------------------------------------------------------------
+    # Data fetch
+    # ------------------------------------------------------------------
+
+    @classmethod
+    def data_requirements(cls) -> DataRequirement:
+        """Declare the BEHAVIOR Dataset / OmniGibson-asset download.
+
+        These are the three canonical helpers the upstream
+        ``OmniGibson`` README points users at; they're idempotent
+        (skip when files already exist) so re-running ``data fetch``
+        on a populated directory is cheap.
+        """
+        download_script = (
+            "from omnigibson.utils.asset_utils import ("
+            "download_omnigibson_robot_assets, "
+            "download_behavior_1k_assets, "
+            "download_2025_challenge_task_instances); "
+            "download_omnigibson_robot_assets(); "
+            "download_behavior_1k_assets(accept_license=True); "
+            "download_2025_challenge_task_instances()"
+        )
+        return DataRequirement(
+            license_id="behavior-dataset-tos",
+            license_url="https://behavior.stanford.edu/dataset",
+            container_data_path="/app/BEHAVIOR-1K/datasets",
+            # The 2025-challenge task instances are downloaded last,
+            # so the directory's presence implies the prior two
+            # download_* calls also completed.
+            marker="2025-challenge-task-instances",
+            download_command=(
+                "conda",
+                "run",
+                "--no-capture-output",
+                "-n",
+                "behavior",
+                "python",
+                "-c",
+                download_script,
+            ),
+        )
+
     # ------------------------------------------------------------------
     # Lazy initialization
     # ------------------------------------------------------------------
diff --git a/src/vla_eval/cli/cmd_data.py b/src/vla_eval/cli/cmd_data.py
@@ -0,0 +1,199 @@
+"""``vla-eval data`` subcommand handlers.
+
+Provides a uniform fetch flow for benchmarks whose dataset is licensed
+independently of the harness (e.g. BEHAVIOR-1K's BEHAVIOR Dataset
+ToS).  See :class:`vla_eval.benchmarks.base.DataRequirement` and
+:meth:`vla_eval.benchmarks.base.Benchmark.data_requirements`.
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import shutil
+import subprocess
+import sys
+from pathlib import Path
+
+from vla_eval.benchmarks.base import Benchmark, DataRequirement
+from vla_eval.cli.config_loader import load_config as _load_config
+from vla_eval.config import DockerConfig
+from vla_eval.registry import resolve_import_string
+
+
+def _stderr_console():  # pragma: no cover — same shim cmd_run uses
+    from rich.console import Console
+
+    return Console(stderr=True, soft_wrap=True)
+
+
+def _resolve_benchmark_class(config: dict) -> tuple[type[Benchmark], str]:
+    """Return ``(class, cache_subdir)`` for the first benchmark in config.
+
+    ``cache_subdir`` is the module-path's last package segment, e.g.
+    ``vla_eval.benchmarks.behavior1k.benchmark:X`` → ``behavior1k``.
+    """
+    benchmarks = config.get("benchmarks") or []
+    if not benchmarks:
+        raise ValueError("config has no 'benchmarks' entries")
+    import_string = benchmarks[0].get("benchmark")
+    if not import_string:
+        raise ValueError("first benchmark entry is missing 'benchmark' import string")
+    cls = resolve_import_string(import_string)
+    if not (isinstance(cls, type) and issubclass(cls, Benchmark)):
+        raise TypeError(f"resolved {import_string} to {cls!r}, which is not a Benchmark subclass")
+    module_path = import_string.split(":", 1)[0]
+    parts = module_path.split(".")
+    # Expect …benchmarks.<key>.benchmark — take the second-to-last part.
+    cache_subdir = parts[-2] if len(parts) >= 2 else parts[-1]
+    return cls, cache_subdir
+
+
+def _default_host_data_dir(cache_subdir: str) -> Path:
+    """Return ``${VLA_EVAL_DATA_DIR}/<cache_subdir>`` or the XDG-style default."""
+    base = os.environ.get("VLA_EVAL_DATA_DIR")
+    if base:
+        return Path(base).expanduser() / cache_subdir
+    return Path.home() / ".cache" / "vla-eval" / cache_subdir
+
+
+def _build_docker_argv(
+    image: str,
+    docker_cfg: DockerConfig,
+    host_dir: Path,
+    requirement: DataRequirement,
+    extra_gpus: str | None,
+) -> list[str]:
+    """Build the ``docker run`` argv that downloads the dataset."""
+    argv: list[str] = ["docker", "run", "--rm"]
+    gpus = extra_gpus or docker_cfg.gpus or "all"
+    argv.extend(["--gpus", gpus])
+    for env_pair in docker_cfg.env:
+        argv.extend(["-e", env_pair])
+    argv.extend(["-v", f"{host_dir}:{requirement.container_data_path}"])
+    argv.append(image)
+    argv.extend(requirement.download_command)
+    return argv
+
+
+def cmd_data_fetch(args: argparse.Namespace) -> None:
+    """Fetch the external dataset for a benchmark, mounted at the
+    canonical host cache directory."""
+    con = _stderr_console()
+    config = _load_config(args.config)
+
+    try:
+        bench_cls, cache_subdir = _resolve_benchmark_class(config)
+    except (TypeError, ValueError) as exc:
+        con.print(f"[red]ERROR: {exc}[/red]")
+        sys.exit(1)
+
+    requirement = bench_cls.data_requirements()
+    if requirement is None:
+        con.print(f"[yellow]{bench_cls.__name__} declares no external data requirement; nothing to fetch.[/yellow]")
+        return
+
+    accepted = set(args.accept_license or [])
+    if requirement.license_id not in accepted:
+        con.print(
+            f"[red]ERROR: this dataset requires accepting licence '{requirement.license_id}'.[/red]\n"
+            f"  Read: {requirement.license_url}\n"
+            f"  Re-run: vla-eval data fetch -c {args.config} --accept-license {requirement.license_id}"
+        )
+        sys.exit(1)
+
+    host_dir = Path(args.data_dir).expanduser().resolve() if args.data_dir else _default_host_data_dir(cache_subdir)
+    host_dir.mkdir(parents=True, exist_ok=True)
+
+    marker = host_dir / requirement.marker
+    if marker.exists() and not args.force:
+        con.print(
+            f"[green]Data already present at {host_dir} (marker: {requirement.marker}). "
+            "Use --force to refetch.[/green]"
+        )
+        return
+
+    docker_cfg = DockerConfig.from_dict(config.get("docker"))
+    if not docker_cfg.image:
+        con.print("[red]ERROR: 'docker.image' must be set in the config to fetch data[/red]")
+        sys.exit(1)
+    if shutil.which("docker") is None:
+        con.print("[red]ERROR: 'docker' not found on PATH[/red]")
+        sys.exit(1)
+
+    argv = _build_docker_argv(
+        docker_cfg.image,
+        docker_cfg,
+        host_dir,
+        requirement,
+        extra_gpus=getattr(args, "gpus", None),
+    )
+
+    con.print(f"[bold]Fetching data → {host_dir}[/bold]")
+    con.print(f"  image: {docker_cfg.image}")
+    con.print(f"  mount: {host_dir} → {requirement.container_data_path}")
+    if args.dry_run:
+        con.print("  [yellow]--dry-run[/yellow]: would run:")
+        con.print(f"    {' '.join(argv)}")
+        return
+
+    completed = subprocess.run(argv, check=False)
+    if completed.returncode != 0:
+        con.print(f"[red]ERROR: docker run exited with {completed.returncode}[/red]")
+        sys.exit(completed.returncode)
+    con.print(f"[green]Done. Dataset available at {host_dir}.[/green]")
+
+
+def register(subparsers: argparse._SubParsersAction) -> None:
+    """Wire ``data fetch`` into the top-level ``vla-eval`` parser."""
+    data_parser = subparsers.add_parser(
+        "data",
+        help="Manage external benchmark datasets",
+        description=(
+            "Fetch external datasets that aren't redistributable in the docker image. "
+            "Each benchmark's data requirements are declared in its Benchmark class via "
+            "data_requirements(); see vla_eval.benchmarks.base.DataRequirement."
+        ),
+    )
+    data_sub = data_parser.add_subparsers(dest="data_command", required=True)
+
+    fetch_parser = data_sub.add_parser(
+        "fetch",
+        help="Download a benchmark's external data into the local cache",
+        description=(
+            "Resolves the benchmark class from the config, runs its download command "
+            "inside the benchmark's docker image with the host cache mounted "
+            "read-write at the container's data path. Idempotent: skips if the "
+            "marker file already exists."
+        ),
+    )
+    fetch_parser.add_argument("--config", "-c", required=True, help="Path to a benchmark eval config YAML.")
+    fetch_parser.add_argument(
+        "--accept-license",
+        action="append",
+        default=[],
+        metavar="ID",
+        help="License ID to opt into (e.g. 'behavior-dataset-tos'). Repeatable.",
+    )
+    fetch_parser.add_argument(
+        "--data-dir",
+        default=None,
+        help="Override host data directory. Defaults to "
+        "${VLA_EVAL_DATA_DIR}/<benchmark> or ~/.cache/vla-eval/<benchmark>.",
+    )
+    fetch_parser.add_argument(
+        "--gpus",
+        default=None,
+        help="GPU devices for the fetch container (e.g. '0,1'). Defaults to docker.gpus or 'all'.",
+    )
+    fetch_parser.add_argument(
+        "--force",
+        action="store_true",
+        help="Re-run the download even if the marker file is already present.",
+    )
+    fetch_parser.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="Print the docker command that would run and exit.",
+    )
+    fetch_parser.set_defaults(func=cmd_data_fetch)
diff --git a/src/vla_eval/cli/config_loader.py b/src/vla_eval/cli/config_loader.py
diff --git a/src/vla_eval/cli/main.py b/src/vla_eval/cli/main.py