Fix submit_eval_jobs.py for olmo-eval-internal runs by finbarrtimbers · Pull Request #1644 · allenai/open-instruct

finbarrtimbers · 2026-04-28T20:00:40Z

Summary

Three fixes to scripts/submit_eval_jobs.py so the script can submit working olmo-eval-internal jobs:

Drop the vllm[runai]>=0.19.0 upgrade: it was pulling cu13 wheels on top of the image's cu12.8 install, breaking the CUDA driver match (CUDA driver too old, version 12080 found). The image's bundled vLLM is sufficient.
Pin numpy<2.3 alongside the transformers>=5.4.0 upgrade: transformers 5.4 pulls NumPy 2.4, which is incompatible with numba (Numba needs NumPy 2.2 or less). The transformers upgrade is needed for newer-model tokenizer support.
Fix --sampling_max_tokens override path: was emitting -o max_tokens=N (rejected as not a TaskConfig field). Now emits -o sampling_params.max_tokens=N to match the nested field.

Note: the sampling_params.max_tokens override only deep-merges correctly with the patch on allenai/olmo-eval-internal branch finbarr/fix-sampling-params (prepare_task_items was clobbering the entire SamplingParams dataclass with a dict). Until that lands on main, pass --olmo_eval_ref finbarr/fix-sampling-params.

How to re-run the validation eval

uv run python scripts/submit_eval_jobs.py \
    --model_name qwen3_4b_base_dapo_20260422_083224 \
    --location 01KPTSPMHGEZVYCDNR0XBVJCGZ \
    --tasks aime_2025:pass_at_32 \
    --max_length 32768 \
    --sampling_max_tokens 16384 \
    --cluster ai2/jupiter ai2/saturn ai2/ceres ai2/neptune \
    --priority urgent \
    --preemptible \
    --workspace ai2/open-instruct-dev \
    --olmo_eval_ref finbarr/fix-sampling-params

Runs:

qwen3_4b_base_dapo aime_2025:pass_at_32 (max_tokens=16384) → pass_at_1:minerva_math_flex = 0.1104: Beaker

Test plan

End-to-end submission completes successfully (exit 0, ~9 min).
vLLM server starts (no CUDA-driver mismatch, no numba/NumPy conflict).
Tokenizer loads (transformers>=5.4.0 supports the model's tokenizer config).
sampling_params.max_tokens override is accepted by the CLI.

GPU_TESTS=bypass

🤖 Generated with Claude Code

…aude Opus 4.7 <noreply@anthropic.com>

…ply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates scripts/submit_eval_jobs.py to resolve environment issues by removing a problematic vllm upgrade, pinning numpy<2.3 for compatibility, and correcting the max_tokens override path. Review feedback identified a placeholder REPLACE_ME in the CHANGELOG.md entry that needs to be updated with the correct pull request number before merging.

…l.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…f to SHA at submit time; collapse install script's pip calls. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…stall script header comments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ler always provides a ref. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…dia; tag spec description with git branch+commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

finbarrtimbers added 3 commits April 28, 2026 13:59

Fixes eval.

852577d

Add CHANGELOG entry for submit_eval_jobs.py fixes. Co-Authored-By: Cl…

fa16b36

…aude Opus 4.7 <noreply@anthropic.com>

Fill in PR number in CHANGELOG. Co-Authored-By: Claude Opus 4.7 <nore…

39a8dd1

…ply@anthropic.com>

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread CHANGELOG.md

finbarrtimbers added 5 commits April 28, 2026 14:18

Extract olmo-eval install commands into scripts/eval/install_olmo_eva…

1c34aa7

…l.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Default to ai2/oe-data/olmo-eval-latest image; resolve --olmo_eval_re…

d1bd4ce

…f to SHA at submit time; collapse install script's pip calls. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Tighten build_install_block contract to require resolved SHA; trim in…

82892dc

…stall script header comments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop dead check in install_olmo_eval.sh; set -u handles unset and cal…

4d2af59

…ler always provides a ref. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Split vllm install into a separate venv that symlinks image torch/nvi…

26ffb21

…dia; tag spec description with git branch+commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix submit_eval_jobs.py for olmo-eval-internal runs#1644

Fix submit_eval_jobs.py for olmo-eval-internal runs#1644
finbarrtimbers wants to merge 8 commits intomainfrom
finbarr/fix-eval-script

finbarrtimbers commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

finbarrtimbers commented Apr 28, 2026

Summary

How to re-run the validation eval

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant