Skip to content

Commit 6383474

Browse files
Add scripts/submit_eval_jobs_new.py (#1638)
* Add scripts/submit_eval_jobs_new.py to submit olmo-eval-internal jobs via Beaker. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * cleaned up script * Address PR review: dedup CHANGELOG, sanitize names, gate Weka mounts, use safe_dump. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Address PR review: rename submit_eval_jobs scripts; add --olmo_eval_ref; deprecate old script. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Point auto-launched evals at submit_eval_jobs_old.py to keep existing flag set working. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Add PR link to CHANGELOG entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 5e93387 commit 6383474

7 files changed

Lines changed: 961 additions & 717 deletions

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ All notable changes to this project will be documented in this file.
5050
- Add `--no_auto_dataset_cache` to GRPO and SFT integration test scripts to avoid HuggingFace 504 timeouts on CI runner (https://github.com/allenai/open-instruct/pull/1571).
5151

5252
### Added
53+
- Replace `scripts/submit_eval_jobs.py` with a new olmo-eval-internal launcher (Beaker v2, no gantry); the previous script is preserved as `scripts/submit_eval_jobs_old.py` and emits a `DeprecationWarning` (https://github.com/allenai/open-instruct/pull/1638).
5354
- Add OLMo-core SFT implementation (https://github.com/allenai/open-instruct/pull/1579).
5455
- Add DR-TULU replication script for Qwen 3.5 4B with evolving rubrics, per-tool pool size overrides, `vllm_qwen3_xml` parser, and `<answer>` tag extraction in rubric scoring (https://github.com/allenai/open-instruct/pull/1609).
5556
- Add MiniMax provider support: register `minimax-m2.7` and `minimax-m2.7-highspeed` models in `PRICE_PER_TOKEN` for cost tracking and add cl100k_base encoding support in `context_window_checker` (https://github.com/allenai/open-instruct/pull/1602).

open_instruct/launch_utils.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33

44
from transformers.utils import hub as transformers_hub
55

6+
AUTO_CREATED_BEAKER_CONFIG_DIR = "configs/beaker_configs/auto_created"
7+
68
WEKA_CLUSTERS = [
79
"ai2/jupiter",
810
"ai2/saturn",
@@ -96,3 +98,16 @@ def upload_to_gs_bucket(src_path: str, dest_path: str) -> None:
9698
cmd = ["gsutil", "-o", "GSUtil:parallel_composite_upload_threshold=150M", "cp", "-r", src_path, dest_path]
9799
print(f"Copying model to GS bucket with command: {cmd}")
98100
live_subprocess_output(cmd)
101+
102+
103+
def validate_beaker_workspace(workspace: str) -> None:
104+
parts = workspace.split("/")
105+
if len(parts) != 2 or not all(parts):
106+
raise ValueError(
107+
f"--workspace must be fully qualified as '<org>/<workspace>' (e.g., 'ai2/oe-adapt-general'). Received: '{workspace}'"
108+
)
109+
110+
111+
def auto_created_spec_path(experiment_name: str) -> str:
112+
os.makedirs(AUTO_CREATED_BEAKER_CONFIG_DIR, exist_ok=True)
113+
return os.path.join(AUTO_CREATED_BEAKER_CONFIG_DIR, f"{experiment_name}.yaml")

open_instruct/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1237,7 +1237,7 @@ def launch_ai2_evals_on_weka(
12371237
oe_eval_gpu_multiplier: int | None = None,
12381238
) -> None:
12391239
command = f"""\
1240-
python scripts/submit_eval_jobs.py \
1240+
python scripts/submit_eval_jobs_old.py \
12411241
--model_name {leaderboard_name} \
12421242
--location {path} \
12431243
--is_tuned \

scripts/collect_eval_results.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ def make_parser():
3838
]
3939

4040
parser = argparse.ArgumentParser(
41-
description="""Point this script at a Beaker job created by `submit_eval_jobs.py`.
41+
description="""Point this script at a Beaker job created by `submit_eval_jobs_old.py`.
4242
It will will collect all evaluation metrics and dump them in a json
4343
file. It will also collect summary metrics for each task.""",
4444
epilog="""Usage example:

0 commit comments

Comments
 (0)