refactor(gpqa): drop structured runner; ship configs/gpqa/run.sh by ishandhanani · Pull Request #96 · NVIDIA/srt-slurm

ishandhanani · 2026-04-27T15:53:20Z

Summary

Ports GPQA to the configs/<bench>/ script-based pattern that AIME moved to in #91. Closes step 1 of follow-up issue #92.

Same nemo-run unquoting hazard motivates the move: any benchmark passing Hydra ++overrides through ns eval is one backslash-bearing extract_regex away from silently-broken evaluation. Script-based path is immune.

What ships

Removed

src/srtctl/benchmarks/gpqa.py (62 LOC, sglang.test.run_eval-based runner)
src/srtctl/benchmarks/scripts/gpqa/bench.sh (53 LOC)
BenchmarkType.GPQA enum
gpqa registry assertion in tests/test_benchmarks.py
GPQA row + section in docs/config-reference.md
gpqa mention in examples/example.yaml and docs/architecture.md tree

Added

configs/gpqa/run.sh — orchestrates ns prepare_data gpqa + ns eval --benchmarks=gpqa:\$REPEAT against localhost:8000/v1 (the in-job dynamo frontend). Tuning knobs (MAX_TOKENS, REPEAT, NUM_THREADS, TEMPERATURE, TOP_P, SEED, DATASET, MODEL) env-var overridable; defaults match the upstream reasoning-eval reference (max_tokens=400000, repeat=32, temperature=1.0, num_threads=512). Warns at startup if HF_TOKEN is unset (GPQA Diamond is HF-gated).

Updated

docs/accuracy.md GPQA section → script-based runbook with HF gating note + pointer back to AIME's reasoning-mode env vars
src/srtctl/benchmarks/__init__.py — drop gpqa import + __all__ entry
6 existing type: "gpqa" recipes migrated 1:1 to type: custom + bash /configs/gpqa/run.sh, preserving their MAX_TOKENS / REPEAT / NUM_THREADS as env-var overrides
2 h200 recipes' commented-out type: "gpqa" block replaced with a one-line pointer to the new docs

Recipe shape (from new docs section)

benchmark:
  type: custom
  container_image: nemo-skills    # alias from srtslurm.yaml `containers:`
  env:
    OPENAI_API_KEY: "EMPTY"
    HF_TOKEN: "\${HF_TOKEN}"      # REQUIRED: GPQA Diamond is HF-gated
    # MAX_TOKENS / REPEAT / TEMPERATURE / SEED / etc. all overridable
  command: |
    bash /configs/gpqa/run.sh

Backward compatibility

None. Recipes with type: gpqa will fail schema validation. Migration is 1:1 to the block above. The 6 in-tree GPQA recipes are migrated in this PR.

Test plan

`make check` — 611 passed, 2 skipped (matches pre-change baseline; drops the one gpqa registry assertion)
(post-merge) End-to-end on a GB200 cluster: validate `ns prepare_data gpqa` succeeds with `HF_TOKEN` plumbed through `benchmark.env`, eval lands real `metrics.json`

Follow-up

MMLU / longbenchv2 / gsm8k still to come per #92 — separate focused PRs in the order the issue lays out.

🤖 Generated with Claude Code

Ports GPQA to the same `configs/<bench>/` script-based pattern that AIME moved to in PR #91, per follow-up issue #92. Same nemo-run unquoting hazard motivates the move: any benchmark passing Hydra `++overrides` through `ns eval` is one backslash-bearing extract_regex away from silently-broken evaluation. Replaces `src/srtctl/benchmarks/gpqa.py` (sglang.test.run_eval-based) with a NeMo-Skills-driven `configs/gpqa/run.sh` that recipes wire up via `type: custom`. Defaults match the upstream reasoning-eval reference (--benchmarks=gpqa:32, max_tokens=400000, temperature=1.0); all knobs overridable via env. Script warns at startup if HF_TOKEN is unset since GPQA Diamond is HF-gated. Removed: - `src/srtctl/benchmarks/gpqa.py` (62 LOC) - `src/srtctl/benchmarks/scripts/gpqa/bench.sh` (53 LOC) - `BenchmarkType.GPQA` enum - `gpqa` registry assertion in `tests/test_benchmarks.py` - GPQA row + section in `docs/config-reference.md` - `gpqa` mention in `examples/example.yaml` and `docs/architecture.md` tree Updated: - `docs/accuracy.md` GPQA section → script-based runbook (recipe shape, HF gating note, reasoning-mode env vars pointer back to AIME) - `src/srtctl/benchmarks/__init__.py` — drop gpqa import + __all__ entry - 6 existing `type: "gpqa"` recipes migrated 1:1 to `type: custom` + `bash /configs/gpqa/run.sh`, preserving their MAX_TOKENS / REPEAT / NUM_THREADS as env-var overrides - 2 h200 recipes' commented-out `type: "gpqa"` block replaced with a one-line pointer to the new docs Backward compat: none. Recipes with `type: gpqa` will fail schema validation — migration is a 1:1 swap to the `type: custom` block above. Test plan: `make check` — 611 passed, 2 skipped (matches pre-change baseline). MMLU / longbenchv2 / gsm8k untouched per the issue's focused-PR-per-benchmark plan. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

qiching

LGTM

ishandhanani requested review from alec-flowers, csahithi, hjjq, kedarpotdar-nv, kyleliang-nv, nlevin-ui and qiching as code owners April 27, 2026 15:53

qiching approved these changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(gpqa): drop structured runner; ship configs/gpqa/run.sh#96

refactor(gpqa): drop structured runner; ship configs/gpqa/run.sh#96
ishandhanani wants to merge 1 commit intomainfrom
refactor/gpqa-script-runner

ishandhanani commented Apr 27, 2026

Uh oh!

qiching left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ishandhanani commented Apr 27, 2026

Summary

What ships

Removed

Added

Updated

Recipe shape (from new docs section)

Backward compatibility

Test plan

Follow-up

Uh oh!

qiching left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants