Skip to content

Commit a3dc44d

Browse files
authored
mb benchmarks markdown (#27)
* run_benchmarks script * pass missing max_episodes in eval_main * mb-bench markdown replacing run_benchmarks script * Missing bench dir prefix * Fix wrist camera name * pnp proper bench dir * Fix body to geoms * bump bench-v2 version * Option A. Mention eval_to_csv usage from each benchmark instructions * More detailed description of the effect of max_episodes * Making some more sense for max_episodes * Added Leaderboard docs naming to ms-bench and mb-bench mds
1 parent bd6c1db commit a3dc44d

7 files changed

Lines changed: 168 additions & 17 deletions

File tree

molmo_spaces/env/object_manager.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1614,9 +1614,9 @@ def clear(self):
16141614
def get_body_to_geoms(self):
16151615
body_to_geom_ids = defaultdict(set)
16161616
for geom_id in range(0, self.model.ngeom):
1617-
body_id = self.model.geom(geom_id).bodyid
1618-
root_id = self.model.body(body_id).rootid
1619-
body_to_geom_ids[int(root_id)].add(int(geom_id))
1617+
body_id = int(self.model.geom(geom_id).bodyid.item())
1618+
root_id = int(self.model.body(body_id).rootid.item())
1619+
body_to_geom_ids[root_id].add(geom_id)
16201620
return {
16211621
key: sorted(values)
16221622
for key, values in body_to_geom_ids.items()

molmo_spaces/evaluation/README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ This README focuses on benchmark installation and running.
1212
- Submitting results, see the GitHub issue in the repository [here](https://github.com/allenai/molmospaces/issues/8).
1313
- Theoretical notes on policy comparison can be found [here](https://docs.google.com/document/d/1FcMxJgAQ_2Ojd2uu8HE2MBfD6RE53zcXa55_r8EfPts/export?format=pdf)
1414

15-
1615
## Concepts
1716

1817
The MolmoSpaces **leaderboard** shows the results of various polices on benchmarks.
@@ -75,6 +74,10 @@ uv run scripts/serve_policy.py --port=8080 policy:checkpoint \
7574

7675
#### 2. Run the benchmark
7776

77+
Please look at the concrete commands for each task type in our [leaderboard](https://molmospaces.allen.ai/leaderboard):
78+
- MolmoSpaces tasks (`MS-` prefix): [ms-bench](ms-bench.md)
79+
- MolmoBot tasks (`MB-` prefix): [mb-bench](mb-bench.md)
80+
7881
If using OpenPI models: `pip install openpi_client`.
7982

8083
For this we chose the easy `MS-Pick` benchmark, which is located here `assets/benchmarks/molmospaces-bench-v1/procthor-10k/FrankaPickDroidMiniBench/FrankaPickDroidMiniBench_json_benchmark_20251231/`.
@@ -197,6 +200,10 @@ class MyEvalConfig(JsonBenchmarkEvalConfig):
197200

198201
### 4. Run Evaluation
199202

203+
Please look at the concrete commands for each task type in our [leaderboard](https://molmospaces.allen.ai/leaderboard):
204+
- MolmoSpaces tasks (`MS-` prefix): [ms-bench](ms-bench.md)
205+
- MolmoBot tasks (`MB-` prefix): [mb-bench](mb-bench.md)
206+
200207
Command line:
201208

202209
```bash

molmo_spaces/evaluation/eval_main.py

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,14 @@ def get_args():
231231
"--max_episodes",
232232
type=int,
233233
default=None,
234-
help="Maximum number of episodes to evaluate from benchmark. If None, evaluates all episodes.",
234+
help="Limit number of episodes to evaluate from benchmark. If None, evaluates all episodes, else, evaluates only the episodes for the houses used in the first `max_episodes`. Note that the final number of episodes can differ from `max_episodes` if more than one episode is sampled for any of the houses among the first `max_episodes` episodes.",
235+
)
236+
parser.add_argument(
237+
"--camera_names",
238+
type=str,
239+
nargs="+",
240+
default=None,
241+
help="Override policy_config.camera_names (e.g. --camera_names randomized_zed2_analogue_1 wrist_camera).",
235242
)
236243

237244
# Eval camera randomization flags (shared across all JSON eval entry points)
@@ -354,6 +361,7 @@ class EvalRuntimeParams:
354361
"""
355362

356363
episode_idx: int | None = None
364+
max_episodes: int | None = None
357365
add_custom_object: bool = False
358366
custom_object_path: str | Path | None = None
359367
custom_object_name: str | None = None
@@ -442,6 +450,8 @@ def run_evaluation(
442450
preloaded_policy: BasePolicy | None = None,
443451
max_episodes: int | None = None,
444452
camera_config_override: Any | None = None,
453+
camera_names_override: list[str] | None = None,
454+
use_filament: bool = False,
445455
environment_light_intensity: float | None = None,
446456
episode_idx: int | None = None,
447457
add_custom_object: bool = False,
@@ -469,6 +479,8 @@ def run_evaluation(
469479
max_episodes: Maximum number of episodes to evaluate from benchmark. If None, evaluates all episodes.
470480
camera_config_override: Optional camera system config (e.g. FrankaEvalCameraSystem) to
471481
replace the default camera_config on the experiment config.
482+
camera_names_override: Optional list of camera names to override
483+
policy_config.camera_names (e.g. ["randomized_zed2_analogue_1", "wrist_camera"]).
472484
episode_idx: Index of a specific episode to evaluate. If None, evaluates all episodes.
473485
add_custom_object: Whether to replace the target object with a custom object.
474486
custom_object_path: Path to the custom object XML file. Required if add_custom_object is True.
@@ -604,15 +616,22 @@ def run_evaluation(
604616
camera_config_override=camera_config_override,
605617
)
606618

607-
# Custom filmanet settings to overwrite by the user
619+
# Custom filament settings to overwrite by the user
620+
exp_config.use_filament |= use_filament
608621
exp_config.environment_light_intensity = (
609622
environment_light_intensity or exp_config.environment_light_intensity
610623
)
611624

625+
# Override policy camera names if requested
626+
if camera_names_override is not None:
627+
log.info(f"Overriding policy_config.camera_names: {camera_names_override}")
628+
exp_config.policy_config.camera_names = camera_names_override
629+
612630
# Patch config with evaluation-specific runtime parameters
613631
exp_config = JsonEvalRunner.patch_config(
614632
exp_config=exp_config,
615633
episode_idx=episode_idx,
634+
max_episodes=max_episodes,
616635
add_custom_object=add_custom_object,
617636
custom_object_path=custom_object_path,
618637
custom_object_name=custom_object_name,
@@ -729,8 +748,11 @@ def main() -> None:
729748
num_workers=args.num_workers,
730749
use_wandb=not args.no_wandb,
731750
wandb_project=args.wandb_project,
751+
max_episodes=args.max_episodes,
752+
use_filament=args.use_filament,
732753
environment_light_intensity=args.environment_light_intensity,
733754
camera_config_override=eval_camera_config,
755+
camera_names_override=args.camera_names,
734756
episode_idx=args.idx,
735757
add_custom_object=args.add_custom_object,
736758
custom_object_path=args.custom_object_path,

molmo_spaces/evaluation/json_eval_runner.py

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ class JsonEvalRunner(ParallelRolloutRunner):
5555
def patch_config(
5656
exp_config: MlSpacesExpConfig,
5757
episode_idx: int | None = None,
58+
max_episodes: int | None = None,
5859
add_custom_object: bool = False,
5960
custom_object_path: str | Path | None = None,
6061
custom_object_name: str | None = None,
@@ -69,6 +70,11 @@ def patch_config(
6970
exp_config: The experiment config to patch
7071
episode_idx: Optional index of a specific episode to evaluate. If provided,
7172
only that episode will be evaluated and the process will stop after it.
73+
max_episodes: Optional maximum number of episodes to evaluate. If provided,
74+
only the episodes for the houses used in the first N episodes will be
75+
evaluated. Note that the final number of episodes can differ from N
76+
if more than one episode is sampled for any of the houses among the
77+
first N episodes.
7278
add_custom_object: Whether to replace the target object with a custom object.
7379
custom_object_path: Path to the custom object XML file. Required if
7480
add_custom_object is True.
@@ -89,6 +95,7 @@ def patch_config(
8995
# eval_runtime_params is now a proper field in MlSpacesExpConfig, so normal assignment works
9096
exp_config.eval_runtime_params = EvalRuntimeParams(
9197
episode_idx=episode_idx,
98+
max_episodes=max_episodes,
9299
add_custom_object=add_custom_object,
93100
custom_object_path=custom_object_path,
94101
custom_object_name=custom_object_name,
@@ -127,13 +134,19 @@ def __init__(
127134
f"Expected benchmark.json file with list of episode specs."
128135
)
129136

137+
eval_params = exp_config.eval_runtime_params
138+
if eval_params.max_episodes is not None and len(all_episodes) > eval_params.max_episodes:
139+
log.info(
140+
f"Limiting to first {eval_params.max_episodes} of {len(all_episodes)} episodes"
141+
)
142+
all_episodes = all_episodes[: eval_params.max_episodes]
143+
130144
self._episodes_by_house: dict[int, list[EpisodeSpec]] = defaultdict(list)
131145
for ep in all_episodes:
132146
self._episodes_by_house[ep.house_index].append(ep)
133147
self._episodes_by_house = dict(self._episodes_by_house)
134148

135149
# If episode_idx is specified, only process the house containing that episode
136-
eval_params = exp_config.eval_runtime_params
137150
episode_idx = eval_params.episode_idx
138151
if episode_idx is not None:
139152
if episode_idx < 0 or episode_idx >= len(all_episodes):
@@ -190,8 +203,13 @@ def load_episodes_for_house(
190203
)
191204
return [], None
192205

193-
# Filter by episode index if specified
194206
eval_params = exp_config.eval_runtime_params
207+
208+
# Truncate to max_episodes before any filtering
209+
if eval_params.max_episodes is not None and len(all_episodes) > eval_params.max_episodes:
210+
all_episodes = all_episodes[: eval_params.max_episodes]
211+
212+
# Filter by episode index if specified
195213
episode_idx = eval_params.episode_idx
196214
if episode_idx is not None:
197215
if episode_idx < 0 or episode_idx >= len(all_episodes):
@@ -217,7 +235,6 @@ def load_episodes_for_house(
217235
return [], None
218236

219237
# Apply custom object replacement if requested
220-
eval_params = exp_config.eval_runtime_params
221238
add_custom_object = eval_params.add_custom_object
222239
custom_object_path = eval_params.custom_object_path
223240
custom_object_name = eval_params.custom_object_name
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# MolmoBot Benchmarks
2+
3+
## Usage
4+
5+
We first run an evaluation like
6+
```bash
7+
python molmo_spaces/evaluation/eval_main.py \
8+
<YOUR_POLICY_CONFIG> \
9+
[OPTIONS] \
10+
--benchmark_dir <BENCHMARK_DIR> \
11+
--output_dir <eval_output_dir>
12+
```
13+
Please see detailed commands for each task type below, and replace `<YOUR_POLICY_CONFIG>` with your evaluation config (e.g. `molmo_spaces.evaluation.configs.evaluation_configs:PiPolicyEvalConfig`).
14+
15+
Finally, run the evaluation output script that aggregates results as csv files:
16+
```bash
17+
python scripts/benchmarks/eval_to_csv.py \
18+
<eval_output_dir>/<date_str> \
19+
<policy_name> \
20+
--success-condition both \
21+
--output-csv /eg/path/to/<task_type>/<policy_name>.csv
22+
```
23+
24+
## Benchmarks with classic renderer
25+
26+
For benchmarks using classic renderer we need to install the `mujoco` version from [our dependencies](../../pyproject.toml), e.g., by calling
27+
```bash
28+
pip install -e ".[mujoco]"
29+
```
30+
from the project root directory.
31+
32+
### Pick-MSProc (Pick-v1.5)
33+
34+
```bash
35+
python molmo_spaces/evaluation/eval_main.py <YOUR_POLICY_CONFIG> \
36+
--benchmark_dir $MLSPACES_ASSETS_DIR/benchmarks/molmospaces-bench-v2/procthor-10k/FrankaPickDroidMiniBench/FrankaPickDroidMiniBench_json_benchmark_20251231
37+
```
38+
39+
### Pick-Classic (Pick-v2-classic)
40+
41+
```bash
42+
python molmo_spaces/evaluation/eval_main.py <YOUR_POLICY_CONFIG> \
43+
--benchmark_dir $MLSPACES_ASSETS_DIR/benchmarks/molmospaces-bench-v2/procthor-objaverse/FrankaPickHardBench/FrankaPickHardBench_20260206_json_benchmark
44+
```
45+
46+
## Benchmarks with filament renderer
47+
48+
For benchmarks using filament we should install `mujoco-filament` from [our dependencies](../../pyproject.toml), e.g., by calling
49+
```bash
50+
pip install -e ".[mujoco-filament]"
51+
```
52+
from the project root directory and pass the `--use-filament` option to the evaluation script.
53+
54+
### Pick-Filament (Pick-v2-filament)
55+
56+
```bash
57+
python molmo_spaces/evaluation/eval_main.py <YOUR_POLICY_CONFIG> \
58+
--use-filament \
59+
--benchmark_dir $MLSPACES_ASSETS_DIR/benchmarks/molmospaces-bench-v2/procthor-objaverse/FrankaPickHardBench/FrankaPickHardBench_20260206_json_benchmark
60+
```
61+
62+
### Pick-RandCam (Pick-v2-rand-cam)
63+
64+
```bash
65+
python molmo_spaces/evaluation/eval_main.py <YOUR_POLICY_CONFIG> \
66+
--use-filament \
67+
--camera_names randomized_zed2_analogue_1 wrist_camera_zed_mini \
68+
--benchmark_dir $MLSPACES_ASSETS_DIR/benchmarks/molmospaces-bench-v2/procthor-objaverse/FrankaPickHardBench/FrankaPickHardBench_20260206_json_benchmark
69+
```
70+
71+
### Pick & Place (PnP-v2)
72+
73+
```bash
74+
python molmo_spaces/evaluation/eval_main.py <YOUR_POLICY_CONFIG> \
75+
--use-filament \
76+
--benchmark_dir $MLSPACES_ASSETS_DIR/benchmarks/molmospaces-bench-v2/procthor-objaverse/FrankaPickandPlaceHardBench/FrankaPickandPlaceHardBench_20260206_json_benchmark
77+
```
78+
79+
### Pick & Place-NextTo (PnP-next-to-v2)
80+
81+
```bash
82+
python molmo_spaces/evaluation/eval_main.py <YOUR_POLICY_CONFIG> \
83+
--use-filament \
84+
--benchmark_dir $MLSPACES_ASSETS_DIR/benchmarks/molmospaces-bench-v2/procthor-objaverse/FrankaPickandPlaceNextToHardBench/FrankaPickandPlaceNextToHardBench_20260305_json_benchmark
85+
```
86+
87+
### Pick & Place-Color (PnP-color-v2)
88+
89+
```bash
90+
python molmo_spaces/evaluation/eval_main.py <YOUR_POLICY_CONFIG> \
91+
--use-filament \
92+
--benchmark_dir $MLSPACES_ASSETS_DIR/benchmarks/molmospaces-bench-v2/procthor-objaverse/FrankaPickandPlaceColorHardBench/FrankaPickandPlaceColorHardBench_20260304_json_benchmark
93+
```

molmo_spaces/evaluation/ms-bench.md

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,36 +2,48 @@
22

33
## Usage
44

5+
We first run an evaluation like
56
```bash
6-
python molmo_spaces/evaluation/eval_main.py <YOUR_POLICY_CONFIG> --benchmark_dir <BENCHMARK_DIR>
7+
python molmo_spaces/evaluation/eval_main.py \
8+
<YOUR_POLICY_CONFIG> \
9+
--benchmark_dir <BENCHMARK_DIR> \
10+
--output_dir <eval_output_dir>
11+
```
12+
Please see detailed commands for each task type below, and replace `<YOUR_POLICY_CONFIG>` with your evaluation config (e.g. `molmo_spaces.evaluation.configs.evaluation_configs:PiPolicyEvalConfig`).
13+
14+
Finally, run the evaluation output script that aggregates results as csv files:
15+
```bash
16+
python scripts/benchmarks/eval_to_csv.py \
17+
<eval_output_dir>/<date_str> \
18+
<policy_name> \
19+
--success-condition both \
20+
--output-csv /eg/path/to/<task_type>/<policy_name>.csv
721
```
8-
9-
Replace `<YOUR_POLICY_CONFIG>` with your evaluation config (e.g. `molmo_spaces.evaluation.configs.evaluation_configs:PiPolicyEvalConfig`).
1022

1123
## Benchmarks
1224

13-
### Close
25+
### Close (Close-v1)
1426

1527
```bash
1628
python molmo_spaces/evaluation/eval_main.py YOUR_POLICY_CONFIG \
1729
--benchmark_dir assets/benchmarks/molmospaces-bench-v1/ithor/FrankaCloseDataGenConfig/FrankaCloseDataGenConfig_20260123_json_benchmark
1830
```
1931

20-
### Open
32+
### Open (Open-v1)
2133

2234
```bash
2335
python molmo_spaces/evaluation/eval_main.py YOUR_POLICY_CONFIG \
2436
--benchmark_dir assets/benchmarks/molmospaces-bench-v1/ithor/FrankaOpenDataGenConfig/FrankaOpenDataGenConfig_20260123_json_benchmark
2537
```
2638

27-
### Pick
39+
### Pick (Pick-v1.1)
2840

2941
```bash
3042
python molmo_spaces/evaluation/eval_main.py YOUR_POLICY_CONFIG \
3143
--benchmark_dir assets/benchmarks/molmospaces-bench-v1/procthor-10k/FrankaPickDroidMiniBench/FrankaPickDroidMiniBench_json_benchmark_20251231
3244
```
3345

34-
### Pick and Place
46+
### Pick and Place (PnP-v1)
3547

3648
```bash
3749
python molmo_spaces/evaluation/eval_main.py YOUR_POLICY_CONFIG \

molmo_spaces/molmo_spaces_constants.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ def resource_manager_log_level(log_level=logging.DEBUG):
125125
},
126126
benchmarks={
127127
"molmospaces-bench-v1": "20260408",
128-
"molmospaces-bench-v2": "20240407",
128+
"molmospaces-bench-v2": "20260415",
129129
},
130130
)
131131

0 commit comments

Comments
 (0)