Skip to content

Fix submit_eval_jobs.py for olmo-eval-internal runs#1644

Open
finbarrtimbers wants to merge 8 commits intomainfrom
finbarr/fix-eval-script
Open

Fix submit_eval_jobs.py for olmo-eval-internal runs#1644
finbarrtimbers wants to merge 8 commits intomainfrom
finbarr/fix-eval-script

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

Summary

Three fixes to scripts/submit_eval_jobs.py so the script can submit working olmo-eval-internal jobs:

  • Drop the vllm[runai]>=0.19.0 upgrade: it was pulling cu13 wheels on top of the image's cu12.8 install, breaking the CUDA driver match (CUDA driver too old, version 12080 found). The image's bundled vLLM is sufficient.
  • Pin numpy<2.3 alongside the transformers>=5.4.0 upgrade: transformers 5.4 pulls NumPy 2.4, which is incompatible with numba (Numba needs NumPy 2.2 or less). The transformers upgrade is needed for newer-model tokenizer support.
  • Fix --sampling_max_tokens override path: was emitting -o max_tokens=N (rejected as not a TaskConfig field). Now emits -o sampling_params.max_tokens=N to match the nested field.

Note: the sampling_params.max_tokens override only deep-merges correctly with the patch on allenai/olmo-eval-internal branch finbarr/fix-sampling-params (prepare_task_items was clobbering the entire SamplingParams dataclass with a dict). Until that lands on main, pass --olmo_eval_ref finbarr/fix-sampling-params.

How to re-run the validation eval

uv run python scripts/submit_eval_jobs.py \
    --model_name qwen3_4b_base_dapo_20260422_083224 \
    --location 01KPTSPMHGEZVYCDNR0XBVJCGZ \
    --tasks aime_2025:pass_at_32 \
    --max_length 32768 \
    --sampling_max_tokens 16384 \
    --cluster ai2/jupiter ai2/saturn ai2/ceres ai2/neptune \
    --priority urgent \
    --preemptible \
    --workspace ai2/open-instruct-dev \
    --olmo_eval_ref finbarr/fix-sampling-params

Runs:

  1. qwen3_4b_base_dapo aime_2025:pass_at_32 (max_tokens=16384) → pass_at_1:minerva_math_flex = 0.1104: Beaker

Test plan

  • End-to-end submission completes successfully (exit 0, ~9 min).
  • vLLM server starts (no CUDA-driver mismatch, no numba/NumPy conflict).
  • Tokenizer loads (transformers>=5.4.0 supports the model's tokenizer config).
  • sampling_params.max_tokens override is accepted by the CLI.

GPU_TESTS=bypass

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates scripts/submit_eval_jobs.py to resolve environment issues by removing a problematic vllm upgrade, pinning numpy<2.3 for compatibility, and correcting the max_tokens override path. Review feedback identified a placeholder REPLACE_ME in the CHANGELOG.md entry that needs to be updated with the correct pull request number before merging.

Comment thread CHANGELOG.md
…l.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…f to SHA at submit time; collapse install script's pip calls. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…stall script header comments. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ler always provides a ref. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dia; tag spec description with git branch+commit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant