allenai
diff --git a/‎.claude/skills/add-benchmark/SKILL.md‎
Lines changed: 160 additions & 0 deletions b/‎.claude/skills/add-benchmark/SKILL.md‎
Lines changed: 160 additions & 0 deletions
diff --git a/‎.claude/skills/add-model-server/SKILL.md‎
Lines changed: 180 additions & 0 deletions b/‎.claude/skills/add-model-server/SKILL.md‎
Lines changed: 180 additions & 0 deletions
diff --git a/‎.dockerignore‎
Lines changed: 10 additions & 0 deletions b/‎.dockerignore‎
Lines changed: 10 additions & 0 deletions
@@ -0,0 +1,160 @@
+# Skill: add-benchmark
+
+Add a new simulation benchmark to the VLA evaluation harness.
+
+## Trigger
+
+User asks to add/create/integrate a new benchmark (e.g. "add ManiSkill3 benchmark", "integrate OmniGibson").
+
+## Steps
+
+### 1. Gather Requirements
+
+Ask the user (if not already provided):
+- **Benchmark name** (e.g. `maniskill3`)
+- **Simulation framework** (e.g. MuJoCo, SAPIEN, PyBullet, Isaac Sim)
+- **Key dependencies** (pip packages needed inside Docker)
+- **Observation format** (which cameras, image resolution, whether to include proprioceptive state)
+- **Action space** (dimension, format — e.g. 7-DoF delta EEF + gripper)
+- **Success condition** (how to detect task completion)
+- **Max steps per episode** (if fixed or per-task)
+
+### 2. Create Benchmark Module
+
+Create `src/vla_eval/benchmarks/<name>/`:
+
+```
+src/vla_eval/benchmarks/<name>/
+├── __init__.py      # empty
+├── benchmark.py     # main implementation
+└── utils.py         # optional helpers
+```
+
+**`benchmark.py`** must subclass `Benchmark` from `vla_eval.benchmarks.base` and implement **6 required methods**:
+
+```python
+from vla_eval.benchmarks.base import Benchmark, StepResult
+
+class MyBenchmark(Benchmark):
+    def __init__(self, **kwargs):
+        # Accept benchmark-specific params from config YAML `params:` section.
+        # Lazily import heavy deps (MuJoCo, SAPIEN, etc.) — NOT at module level.
+        ...
+
+    def get_tasks(self) -> list[dict[str, Any]]:
+        # Return list of task dicts. Each MUST have a "name" key.
+        # May include "suite" for task filtering.
+        ...
+
+    def reset(self, task: dict[str, Any]) -> tuple[Any, dict[str, Any]]:
+        # Reset env for task. Returns (env_handle, initial_obs_dict).
+        # env_handle is opaque — passed back to step().
+        # obs_dict should be the output of make_obs().
+        # task dict has "episode_idx" (int) injected by orchestrator.
+        ...
+
+    def step(self, env: Any, action: dict[str, Any]) -> StepResult:
+        # action dict has "actions" key (np.ndarray from model server).
+        # Return StepResult(obs, reward, done, info).
+        ...
+
+    def make_obs(self, raw_obs: Any, task: dict[str, Any]) -> dict[str, Any]:
+        # Convert raw env observation to dict for model server.
+        # Convention: {"images": {"cam_name": np.ndarray HWC uint8},
+        #              "task_description": str}
+        # Optionally add "states": np.ndarray for proprioception.
+        ...
+
+    def is_done(self, step_result: StepResult) -> bool:
+        # Return True to end the episode.
+        ...
+
+    def get_result(self, step_result: StepResult) -> dict[str, Any]:
+        # Must return at least {"success": bool}.
+        ...
+
+    def get_metadata(self) -> dict[str, Any]:
+        # Optional. Return {"max_steps": N} for benchmark default.
+        ...
+```
+
+### Key Patterns (from existing implementations)
+
+- **Lazy imports**: Put heavy sim imports (torch, robosuite, sapien) inside methods, not at module top. This allows the registry to resolve the class without loading the sim.
+- **Env reuse**: LIBERO reuses env across episodes of the same task. SimplerEnv creates a fresh env per episode. Choose based on the sim's reset semantics.
+- **Action processing**: Model servers output raw continuous actions. The benchmark must convert to sim-specific format (e.g. discretize gripper, convert euler→axis-angle).
+- **Image preprocessing**: If the sim outputs non-standard images (flipped, wrong resolution), handle in `make_obs()`.
+- **EGL headless rendering**: Set `os.environ.setdefault("PYOPENGL_PLATFORM", "egl")` at module top if the sim uses OpenGL.
+
+### 3. Create Config YAML
+
+Create `configs/<name>_eval.yaml`:
+
+```yaml
+server:
+  url: "ws://localhost:8000"
+
+docker:
+  image: <name>
+  env: []     # e.g. ["NVIDIA_DRIVER_CAPABILITIES=all"] for Vulkan
+  volumes: [] # e.g. ["/path/to/data:/data:ro"]
+
+output_dir: "./results"
+
+benchmarks:
+  - benchmark: "vla_eval.benchmarks.<name>.benchmark:MyBenchmark"
+    mode: sync
+    episodes_per_task: 50
+    params:
+      # All keys here are passed as **kwargs to MyBenchmark.__init__()
+      suite: default
+      seed: 7
+```
+
+- `benchmark` field: full import string in `module.path:ClassName` format
+- `params`: arbitrary dict passed to constructor — no schema enforcement
+- `max_steps`: omit to use `get_metadata()["max_steps"]`, or set explicitly to override
+
+### 4. Create Dockerfile
+
+Create `docker/Dockerfile.<name>`:
+
+```dockerfile
+FROM <base_image>
+
+# Install harness
+WORKDIR /workspace
+COPY pyproject.toml README.md ./
+COPY src/ src/
+ARG HARNESS_VERSION=0.0.0
+ENV SETUPTOOLS_SCM_PRETEND_VERSION=${HARNESS_VERSION}
+RUN pip install .
+
+COPY configs/ configs/
+
+ENTRYPOINT ["vla-eval"]
+CMD ["run", "--config", "/workspace/configs/<name>_eval.yaml"]
+```
+
+### 5. Register in Build/Push Scripts
+
+Add the new benchmark to the arrays in `docker/build.sh` and `docker/push.sh`:
+
+- `BENCHMARKS=(... <name> ...)` in `docker/build.sh`
+- `IMAGES=(... <name> ...)` in `docker/push.sh`
+
+If the name contains underscores (e.g. `mikasa_robo`), the scripts automatically convert them to hyphens for the Docker image name (`mikasa-robo`).
+
+### 6. Verify
+
+1. Run `make check` — lint + format + type check
+2. Run `make test` — ensure existing tests still pass
+3. Run `vla-eval test --validate` — validate all config import strings (including the new one)
+4. Run `vla-eval test -c configs/<name>_eval.yaml` — smoke-test the benchmark (requires Docker + the benchmark image; runs 1 episode with an EchoModelServer, no real model or GPU needed)
+
+### Reference Implementations
+
+- **LIBERO** (`benchmarks/libero/benchmark.py`): MuJoCo tabletop, env reuse, suite-specific max_steps, image flip preprocessing
+- **SimplerEnv** (`benchmarks/simpler/benchmark.py`): SAPIEN+Vulkan, new env per episode, Euler→axis-angle action conversion
+- **CALVIN** (`benchmarks/calvin/benchmark.py`): PyBullet, chained subtasks, delta actions, hardcoded normalization stats
+
@@ -0,0 +1,180 @@
+# Skill: add-model-server
+
+Add a new VLA model server to the evaluation harness.
+
+## Trigger
+
+User asks to add/integrate a new model (e.g. "add OpenVLA server", "integrate RT-2").
+
+## Steps
+
+### 1. Gather Requirements
+
+Ask the user (if not already provided):
+- **Model name** (e.g. `openvla`)
+- **Framework/library** (e.g. HuggingFace Transformers, custom repo)
+- **Python dependencies** (torch version, model-specific packages)
+- **Checkpoint source** (HuggingFace Hub model ID or local path)
+- **Action output format** (dimension, chunk_size, continuous vs discrete)
+- **Input requirements** (single image vs multi-view, needs proprioceptive state?)
+
+### 2. Create Model Server Script
+
+Create `src/vla_eval/model_servers/<name>.py` as a **uv script** (standalone, inline deps).
+
+The file MUST start with a PEP 723 inline script metadata block:
+
+```python
+# /// script
+# requires-python = "~=3.11"
+# dependencies = [
+#     "vla-eval",
+#     "<model-package>",
+#     "torch>=2.0",
+#     "transformers>=4.40,<5",
+#     "pillow>=9.0",
+#     "numpy>=1.24",
+# ]
+#
+# [tool.uv.sources]
+# vla-eval = { path = "../../.." }
+# <model-package> = { git = "https://github.com/org/repo.git", branch = "main" }
+# ///
+```
+
+Subclass `PredictModelServer` (most models) or `ModelServer` (advanced async):
+
+```python
+from vla_eval.model_servers.base import SessionContext
+from vla_eval.model_servers.predict import PredictModelServer
+from vla_eval.model_servers.serve import serve
+
+
+class MyModelServer(PredictModelServer):
+    def __init__(self, checkpoint: str, *, chunk_size: int = 1, action_ensemble: str = "newest", **kwargs):
+        super().__init__(chunk_size=chunk_size, action_ensemble=action_ensemble, **kwargs)
+        self.checkpoint = checkpoint
+        self._model = None
+
+    def _load_model(self) -> None:
+        """Lazily load model on first predict() call."""
+        if self._model is not None:
+            return
+        import torch
+        # Load model here...
+        self._model = ...
+
+    def predict(self, obs: dict[str, Any], ctx: SessionContext) -> dict[str, Any]:
+        """Single-observation inference. Blocking call.
+
+        Args:
+            obs: {"images": {"cam_name": np.ndarray HWC uint8},
+                  "task_description": str,
+                  "states": np.ndarray (optional)}
+            ctx: Session context (session_id, episode_id, step, is_first)
+
+        Returns:
+            {"actions": np.ndarray} with shape:
+              - (action_dim,) if chunk_size == 1
+              - (chunk_size, action_dim) if chunk_size > 1
+        """
+        self._load_model()
+        # Run inference...
+        return {"actions": np.array(actions, dtype=np.float32)}
+```
+
+### Key Patterns (from existing implementations)
+
+**PredictModelServer features (inherited automatically):**
+- **Action chunking**: When `chunk_size > 1`, return `(chunk_size, action_dim)` array. Framework auto-buffers and serves one action per step, re-inferring only when buffer empties.
+- **Action ensemble**: `"newest"` (default), `"average"`, `"ema"` — blends overlapping chunks. Set via `action_ensemble=` in `__init__`.
+- **Batched inference**: Override `predict_batch()` + set `max_batch_size > 1` for GPU-batched multi-shard eval.
+- **Per-suite chunk_size**: Override `on_episode_start()` to set `self._session_chunk_sizes[ctx.session_id] = N` (see CogACT example).
+- **CI/LAAS**: Set `continuous_inference=True` for continuous inference mode (DRAFT).
+
+**Image handling:**
+```python
+from PIL import Image as PILImage
+images = obs.get("images", {})
+img_array = next(iter(images.values()))  # first camera
+pil_image = PILImage.fromarray(img_array).convert("RGB")
+```
+
+**Task description:**
+```python
+text = obs.get("task_description", "")
+```
+
+**Lazy model loading**: Always use a `_load_model()` pattern. Do NOT load in `__init__`.
+
+### 3. Add `if __name__ == "__main__"` Entry Point
+
+The script must be runnable via `uv run`:
+
+```python
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="<Model> server (uv script)")
+    parser.add_argument("--checkpoint", required=True, help="HF model ID or local path")
+    parser.add_argument("--chunk_size", type=int, default=1)
+    parser.add_argument("--action_ensemble", default="newest")
+    parser.add_argument("--host", default="0.0.0.0")
+    parser.add_argument("--port", type=int, default=8000)
+    parser.add_argument("--verbose", "-v", action="store_true")
+    args = parser.parse_args()
+
+    logging.basicConfig(
+        level=logging.DEBUG if args.verbose else logging.INFO,
+        format="%(asctime)s %(levelname)-8s %(name)s: %(message)s",
+    )
+
+    server = MyModelServer(args.checkpoint)
+    server.chunk_size = args.chunk_size
+    server.action_ensemble = args.action_ensemble
+
+    logger.info("Pre-loading model...")
+    server._load_model()
+    logger.info("Model ready, starting server on ws://%s:%d", args.host, args.port)
+    serve(server, host=args.host, port=args.port)
+```
+
+### 4. Create Config YAML
+
+Create `configs/model_servers/<name>.yaml`:
+
+```yaml
+# <Model Name> model server — <benchmark> checkpoint
+# Weight: <HuggingFace model ID>
+# Benchmark: <target benchmark>
+
+script: "src/vla_eval/model_servers/<name>.py"
+args:
+  checkpoint: <org/model-id>
+  chunk_size: 1
+  port: 8000
+```
+
+The CLI runs this via: `vla-eval serve --config configs/model_servers/<name>.yaml`
+which translates to: `uv run <script> --checkpoint <value> --chunk_size <value> --port <value>`
+
+### 5. Verify
+
+1. Run `make check` — lint + format + type check
+2. Run `make test` — ensure existing tests still pass
+3. Suggest user test: `vla-eval test -c configs/model_servers/<name>.yaml`
+   (starts server, sends dummy observations from a StubBenchmark, checks for valid action response — requires `uv` + GPU + model weights)
+
+### Reference Implementations
+
+- **CogACT** (`model_servers/dexbotic/cogact.py`): Diffusion action head, chunk_size_map per suite, batched inference, text template option
+- **starVLA** (`model_servers/starvla.py`): Auto-detecting framework, HuggingFace checkpoint download, monkey-patches for upstream compat
+
+### Server Hierarchy
+
+```
+ModelServer (ABC)                    ← Advanced: async on_observation()
+    └── PredictModelServer           ← Most models: blocking predict()
+```
+
+- Use `PredictModelServer` for standard request-response models (95% of cases)
+- Use `ModelServer` only if you need async streaming or custom message handling
+
@@ -0,0 +1,10 @@
+# Ignore everything by default, then whitelist
+*
+!pyproject.toml
+!README.md
+!src/
+!configs/
+!docker/calvin_validation_data/
+!docker/init_states/
+!docker/*_entrypoint.sh
+