You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* run_benchmarks script
* pass missing max_episodes in eval_main
* mb-bench markdown replacing run_benchmarks script
* Missing bench dir prefix
* Fix wrist camera name
* pnp proper bench dir
* Fix body to geoms
* bump bench-v2 version
* Option A. Mention eval_to_csv usage from each benchmark instructions
* More detailed description of the effect of max_episodes
* Making some more sense for max_episodes
* Added Leaderboard docs naming to ms-bench and mb-bench mds
Copy file name to clipboardExpand all lines: molmo_spaces/evaluation/README.md
+8-1Lines changed: 8 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,6 @@ This README focuses on benchmark installation and running.
12
12
- Submitting results, see the GitHub issue in the repository [here](https://github.com/allenai/molmospaces/issues/8).
13
13
- Theoretical notes on policy comparison can be found [here](https://docs.google.com/document/d/1FcMxJgAQ_2Ojd2uu8HE2MBfD6RE53zcXa55_r8EfPts/export?format=pdf)
14
14
15
-
16
15
## Concepts
17
16
18
17
The MolmoSpaces **leaderboard** shows the results of various polices on benchmarks.
@@ -75,6 +74,10 @@ uv run scripts/serve_policy.py --port=8080 policy:checkpoint \
75
74
76
75
#### 2. Run the benchmark
77
76
77
+
Please look at the concrete commands for each task type in our [leaderboard](https://molmospaces.allen.ai/leaderboard):
If using OpenPI models: `pip install openpi_client`.
79
82
80
83
For this we chose the easy `MS-Pick` benchmark, which is located here `assets/benchmarks/molmospaces-bench-v1/procthor-10k/FrankaPickDroidMiniBench/FrankaPickDroidMiniBench_json_benchmark_20251231/`.
@@ -197,6 +200,10 @@ class MyEvalConfig(JsonBenchmarkEvalConfig):
197
200
198
201
### 4. Run Evaluation
199
202
203
+
Please look at the concrete commands for each task type in our [leaderboard](https://molmospaces.allen.ai/leaderboard):
Copy file name to clipboardExpand all lines: molmo_spaces/evaluation/eval_main.py
+24-2Lines changed: 24 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -231,7 +231,14 @@ def get_args():
231
231
"--max_episodes",
232
232
type=int,
233
233
default=None,
234
-
help="Maximum number of episodes to evaluate from benchmark. If None, evaluates all episodes.",
234
+
help="Limit number of episodes to evaluate from benchmark. If None, evaluates all episodes, else, evaluates only the episodes for the houses used in the first `max_episodes`. Note that the final number of episodes can differ from `max_episodes` if more than one episode is sampled for any of the houses among the first `max_episodes` episodes.",
Please see detailed commands for each task type below, and replace `<YOUR_POLICY_CONFIG>` with your evaluation config (e.g. `molmo_spaces.evaluation.configs.evaluation_configs:PiPolicyEvalConfig`).
14
+
15
+
Finally, run the evaluation output script that aggregates results as csv files:
Please see detailed commands for each task type below, and replace `<YOUR_POLICY_CONFIG>` with your evaluation config (e.g. `molmo_spaces.evaluation.configs.evaluation_configs:PiPolicyEvalConfig`).
13
+
14
+
Finally, run the evaluation output script that aggregates results as csv files:
0 commit comments