Skip to content

Commit 3da4dd8

Browse files
authored
Merge pull request #2 from worv-ai/pr/sync-upstream
feat: unified smoke test CLI, Docker dev mode, rich output, and bug fixes
2 parents 5ea6e51 + 1ae38bd commit 3da4dd8

36 files changed

Lines changed: 2245 additions & 1123 deletions
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Skill: add-benchmark
2+
3+
Add a new simulation benchmark to the VLA evaluation harness.
4+
5+
## Trigger
6+
7+
User asks to add/create/integrate a new benchmark (e.g. "add ManiSkill3 benchmark", "integrate OmniGibson").
8+
9+
## Steps
10+
11+
### 1. Gather Requirements
12+
13+
Ask the user (if not already provided):
14+
- **Benchmark name** (e.g. `maniskill3`)
15+
- **Simulation framework** (e.g. MuJoCo, SAPIEN, PyBullet, Isaac Sim)
16+
- **Key dependencies** (pip packages needed inside Docker)
17+
- **Observation format** (which cameras, image resolution, whether to include proprioceptive state)
18+
- **Action space** (dimension, format — e.g. 7-DoF delta EEF + gripper)
19+
- **Success condition** (how to detect task completion)
20+
- **Max steps per episode** (if fixed or per-task)
21+
22+
### 2. Create Benchmark Module
23+
24+
Create `src/vla_eval/benchmarks/<name>/`:
25+
26+
```
27+
src/vla_eval/benchmarks/<name>/
28+
├── __init__.py # empty
29+
├── benchmark.py # main implementation
30+
└── utils.py # optional helpers
31+
```
32+
33+
**`benchmark.py`** must subclass `Benchmark` from `vla_eval.benchmarks.base` and implement **6 required methods**:
34+
35+
```python
36+
from vla_eval.benchmarks.base import Benchmark, StepResult
37+
38+
class MyBenchmark(Benchmark):
39+
def __init__(self, **kwargs):
40+
# Accept benchmark-specific params from config YAML `params:` section.
41+
# Lazily import heavy deps (MuJoCo, SAPIEN, etc.) — NOT at module level.
42+
...
43+
44+
def get_tasks(self) -> list[dict[str, Any]]:
45+
# Return list of task dicts. Each MUST have a "name" key.
46+
# May include "suite" for task filtering.
47+
...
48+
49+
def reset(self, task: dict[str, Any]) -> tuple[Any, dict[str, Any]]:
50+
# Reset env for task. Returns (env_handle, initial_obs_dict).
51+
# env_handle is opaque — passed back to step().
52+
# obs_dict should be the output of make_obs().
53+
# task dict has "episode_idx" (int) injected by orchestrator.
54+
...
55+
56+
def step(self, env: Any, action: dict[str, Any]) -> StepResult:
57+
# action dict has "actions" key (np.ndarray from model server).
58+
# Return StepResult(obs, reward, done, info).
59+
...
60+
61+
def make_obs(self, raw_obs: Any, task: dict[str, Any]) -> dict[str, Any]:
62+
# Convert raw env observation to dict for model server.
63+
# Convention: {"images": {"cam_name": np.ndarray HWC uint8},
64+
# "task_description": str}
65+
# Optionally add "states": np.ndarray for proprioception.
66+
...
67+
68+
def is_done(self, step_result: StepResult) -> bool:
69+
# Return True to end the episode.
70+
...
71+
72+
def get_result(self, step_result: StepResult) -> dict[str, Any]:
73+
# Must return at least {"success": bool}.
74+
...
75+
76+
def get_metadata(self) -> dict[str, Any]:
77+
# Optional. Return {"max_steps": N} for benchmark default.
78+
...
79+
```
80+
81+
### Key Patterns (from existing implementations)
82+
83+
- **Lazy imports**: Put heavy sim imports (torch, robosuite, sapien) inside methods, not at module top. This allows the registry to resolve the class without loading the sim.
84+
- **Env reuse**: LIBERO reuses env across episodes of the same task. SimplerEnv creates a fresh env per episode. Choose based on the sim's reset semantics.
85+
- **Action processing**: Model servers output raw continuous actions. The benchmark must convert to sim-specific format (e.g. discretize gripper, convert euler→axis-angle).
86+
- **Image preprocessing**: If the sim outputs non-standard images (flipped, wrong resolution), handle in `make_obs()`.
87+
- **EGL headless rendering**: Set `os.environ.setdefault("PYOPENGL_PLATFORM", "egl")` at module top if the sim uses OpenGL.
88+
89+
### 3. Create Config YAML
90+
91+
Create `configs/<name>_eval.yaml`:
92+
93+
```yaml
94+
server:
95+
url: "ws://localhost:8000"
96+
97+
docker:
98+
image: <name>
99+
env: [] # e.g. ["NVIDIA_DRIVER_CAPABILITIES=all"] for Vulkan
100+
volumes: [] # e.g. ["/path/to/data:/data:ro"]
101+
102+
output_dir: "./results"
103+
104+
benchmarks:
105+
- benchmark: "vla_eval.benchmarks.<name>.benchmark:MyBenchmark"
106+
mode: sync
107+
episodes_per_task: 50
108+
params:
109+
# All keys here are passed as **kwargs to MyBenchmark.__init__()
110+
suite: default
111+
seed: 7
112+
```
113+
114+
- `benchmark` field: full import string in `module.path:ClassName` format
115+
- `params`: arbitrary dict passed to constructor — no schema enforcement
116+
- `max_steps`: omit to use `get_metadata()["max_steps"]`, or set explicitly to override
117+
118+
### 4. Create Dockerfile
119+
120+
Create `docker/Dockerfile.<name>`:
121+
122+
```dockerfile
123+
FROM <base_image>
124+
125+
# Install harness
126+
WORKDIR /workspace
127+
COPY pyproject.toml README.md ./
128+
COPY src/ src/
129+
ARG HARNESS_VERSION=0.0.0
130+
ENV SETUPTOOLS_SCM_PRETEND_VERSION=${HARNESS_VERSION}
131+
RUN pip install .
132+
133+
COPY configs/ configs/
134+
135+
ENTRYPOINT ["vla-eval"]
136+
CMD ["run", "--config", "/workspace/configs/<name>_eval.yaml"]
137+
```
138+
139+
### 5. Register in Build/Push Scripts
140+
141+
Add the new benchmark to the arrays in `docker/build.sh` and `docker/push.sh`:
142+
143+
- `BENCHMARKS=(... <name> ...)` in `docker/build.sh`
144+
- `IMAGES=(... <name> ...)` in `docker/push.sh`
145+
146+
If the name contains underscores (e.g. `mikasa_robo`), the scripts automatically convert them to hyphens for the Docker image name (`mikasa-robo`).
147+
148+
### 6. Verify
149+
150+
1. Run `make check` — lint + format + type check
151+
2. Run `make test` — ensure existing tests still pass
152+
3. Run `vla-eval test --validate` — validate all config import strings (including the new one)
153+
4. Run `vla-eval test -c configs/<name>_eval.yaml` — smoke-test the benchmark (requires Docker + the benchmark image; runs 1 episode with an EchoModelServer, no real model or GPU needed)
154+
155+
### Reference Implementations
156+
157+
- **LIBERO** (`benchmarks/libero/benchmark.py`): MuJoCo tabletop, env reuse, suite-specific max_steps, image flip preprocessing
158+
- **SimplerEnv** (`benchmarks/simpler/benchmark.py`): SAPIEN+Vulkan, new env per episode, Euler→axis-angle action conversion
159+
- **CALVIN** (`benchmarks/calvin/benchmark.py`): PyBullet, chained subtasks, delta actions, hardcoded normalization stats
160+
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Skill: add-model-server
2+
3+
Add a new VLA model server to the evaluation harness.
4+
5+
## Trigger
6+
7+
User asks to add/integrate a new model (e.g. "add OpenVLA server", "integrate RT-2").
8+
9+
## Steps
10+
11+
### 1. Gather Requirements
12+
13+
Ask the user (if not already provided):
14+
- **Model name** (e.g. `openvla`)
15+
- **Framework/library** (e.g. HuggingFace Transformers, custom repo)
16+
- **Python dependencies** (torch version, model-specific packages)
17+
- **Checkpoint source** (HuggingFace Hub model ID or local path)
18+
- **Action output format** (dimension, chunk_size, continuous vs discrete)
19+
- **Input requirements** (single image vs multi-view, needs proprioceptive state?)
20+
21+
### 2. Create Model Server Script
22+
23+
Create `src/vla_eval/model_servers/<name>.py` as a **uv script** (standalone, inline deps).
24+
25+
The file MUST start with a PEP 723 inline script metadata block:
26+
27+
```python
28+
# /// script
29+
# requires-python = "~=3.11"
30+
# dependencies = [
31+
# "vla-eval",
32+
# "<model-package>",
33+
# "torch>=2.0",
34+
# "transformers>=4.40,<5",
35+
# "pillow>=9.0",
36+
# "numpy>=1.24",
37+
# ]
38+
#
39+
# [tool.uv.sources]
40+
# vla-eval = { path = "../../.." }
41+
# <model-package> = { git = "https://github.com/org/repo.git", branch = "main" }
42+
# ///
43+
```
44+
45+
Subclass `PredictModelServer` (most models) or `ModelServer` (advanced async):
46+
47+
```python
48+
from vla_eval.model_servers.base import SessionContext
49+
from vla_eval.model_servers.predict import PredictModelServer
50+
from vla_eval.model_servers.serve import serve
51+
52+
53+
class MyModelServer(PredictModelServer):
54+
def __init__(self, checkpoint: str, *, chunk_size: int = 1, action_ensemble: str = "newest", **kwargs):
55+
super().__init__(chunk_size=chunk_size, action_ensemble=action_ensemble, **kwargs)
56+
self.checkpoint = checkpoint
57+
self._model = None
58+
59+
def _load_model(self) -> None:
60+
"""Lazily load model on first predict() call."""
61+
if self._model is not None:
62+
return
63+
import torch
64+
# Load model here...
65+
self._model = ...
66+
67+
def predict(self, obs: dict[str, Any], ctx: SessionContext) -> dict[str, Any]:
68+
"""Single-observation inference. Blocking call.
69+
70+
Args:
71+
obs: {"images": {"cam_name": np.ndarray HWC uint8},
72+
"task_description": str,
73+
"states": np.ndarray (optional)}
74+
ctx: Session context (session_id, episode_id, step, is_first)
75+
76+
Returns:
77+
{"actions": np.ndarray} with shape:
78+
- (action_dim,) if chunk_size == 1
79+
- (chunk_size, action_dim) if chunk_size > 1
80+
"""
81+
self._load_model()
82+
# Run inference...
83+
return {"actions": np.array(actions, dtype=np.float32)}
84+
```
85+
86+
### Key Patterns (from existing implementations)
87+
88+
**PredictModelServer features (inherited automatically):**
89+
- **Action chunking**: When `chunk_size > 1`, return `(chunk_size, action_dim)` array. Framework auto-buffers and serves one action per step, re-inferring only when buffer empties.
90+
- **Action ensemble**: `"newest"` (default), `"average"`, `"ema"` — blends overlapping chunks. Set via `action_ensemble=` in `__init__`.
91+
- **Batched inference**: Override `predict_batch()` + set `max_batch_size > 1` for GPU-batched multi-shard eval.
92+
- **Per-suite chunk_size**: Override `on_episode_start()` to set `self._session_chunk_sizes[ctx.session_id] = N` (see CogACT example).
93+
- **CI/LAAS**: Set `continuous_inference=True` for continuous inference mode (DRAFT).
94+
95+
**Image handling:**
96+
```python
97+
from PIL import Image as PILImage
98+
images = obs.get("images", {})
99+
img_array = next(iter(images.values())) # first camera
100+
pil_image = PILImage.fromarray(img_array).convert("RGB")
101+
```
102+
103+
**Task description:**
104+
```python
105+
text = obs.get("task_description", "")
106+
```
107+
108+
**Lazy model loading**: Always use a `_load_model()` pattern. Do NOT load in `__init__`.
109+
110+
### 3. Add `if __name__ == "__main__"` Entry Point
111+
112+
The script must be runnable via `uv run`:
113+
114+
```python
115+
if __name__ == "__main__":
116+
parser = argparse.ArgumentParser(description="<Model> server (uv script)")
117+
parser.add_argument("--checkpoint", required=True, help="HF model ID or local path")
118+
parser.add_argument("--chunk_size", type=int, default=1)
119+
parser.add_argument("--action_ensemble", default="newest")
120+
parser.add_argument("--host", default="0.0.0.0")
121+
parser.add_argument("--port", type=int, default=8000)
122+
parser.add_argument("--verbose", "-v", action="store_true")
123+
args = parser.parse_args()
124+
125+
logging.basicConfig(
126+
level=logging.DEBUG if args.verbose else logging.INFO,
127+
format="%(asctime)s %(levelname)-8s %(name)s: %(message)s",
128+
)
129+
130+
server = MyModelServer(args.checkpoint)
131+
server.chunk_size = args.chunk_size
132+
server.action_ensemble = args.action_ensemble
133+
134+
logger.info("Pre-loading model...")
135+
server._load_model()
136+
logger.info("Model ready, starting server on ws://%s:%d", args.host, args.port)
137+
serve(server, host=args.host, port=args.port)
138+
```
139+
140+
### 4. Create Config YAML
141+
142+
Create `configs/model_servers/<name>.yaml`:
143+
144+
```yaml
145+
# <Model Name> model server — <benchmark> checkpoint
146+
# Weight: <HuggingFace model ID>
147+
# Benchmark: <target benchmark>
148+
149+
script: "src/vla_eval/model_servers/<name>.py"
150+
args:
151+
checkpoint: <org/model-id>
152+
chunk_size: 1
153+
port: 8000
154+
```
155+
156+
The CLI runs this via: `vla-eval serve --config configs/model_servers/<name>.yaml`
157+
which translates to: `uv run <script> --checkpoint <value> --chunk_size <value> --port <value>`
158+
159+
### 5. Verify
160+
161+
1. Run `make check` — lint + format + type check
162+
2. Run `make test` — ensure existing tests still pass
163+
3. Suggest user test: `vla-eval test -c configs/model_servers/<name>.yaml`
164+
(starts server, sends dummy observations from a StubBenchmark, checks for valid action response — requires `uv` + GPU + model weights)
165+
166+
### Reference Implementations
167+
168+
- **CogACT** (`model_servers/dexbotic/cogact.py`): Diffusion action head, chunk_size_map per suite, batched inference, text template option
169+
- **starVLA** (`model_servers/starvla.py`): Auto-detecting framework, HuggingFace checkpoint download, monkey-patches for upstream compat
170+
171+
### Server Hierarchy
172+
173+
```
174+
ModelServer (ABC) ← Advanced: async on_observation()
175+
└── PredictModelServer ← Most models: blocking predict()
176+
```
177+
178+
- Use `PredictModelServer` for standard request-response models (95% of cases)
179+
- Use `ModelServer` only if you need async streaming or custom message handling
180+

.dockerignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Ignore everything by default, then whitelist
2+
*
3+
!pyproject.toml
4+
!README.md
5+
!src/
6+
!configs/
7+
!docker/calvin_validation_data/
8+
!docker/init_states/
9+
!docker/*_entrypoint.sh
10+

0 commit comments

Comments
 (0)