Skip to content

refactor(gpqa): drop structured runner; ship configs/gpqa/run.sh#96

Open
ishandhanani wants to merge 1 commit intomainfrom
refactor/gpqa-script-runner
Open

refactor(gpqa): drop structured runner; ship configs/gpqa/run.sh#96
ishandhanani wants to merge 1 commit intomainfrom
refactor/gpqa-script-runner

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

Summary

Ports GPQA to the configs/<bench>/ script-based pattern that AIME moved to in #91. Closes step 1 of follow-up issue #92.

Same nemo-run unquoting hazard motivates the move: any benchmark passing Hydra ++overrides through ns eval is one backslash-bearing extract_regex away from silently-broken evaluation. Script-based path is immune.

What ships

Removed

  • src/srtctl/benchmarks/gpqa.py (62 LOC, sglang.test.run_eval-based runner)
  • src/srtctl/benchmarks/scripts/gpqa/bench.sh (53 LOC)
  • BenchmarkType.GPQA enum
  • gpqa registry assertion in tests/test_benchmarks.py
  • GPQA row + section in docs/config-reference.md
  • gpqa mention in examples/example.yaml and docs/architecture.md tree

Added

  • configs/gpqa/run.sh — orchestrates ns prepare_data gpqa + ns eval --benchmarks=gpqa:\$REPEAT against localhost:8000/v1 (the in-job dynamo frontend). Tuning knobs (MAX_TOKENS, REPEAT, NUM_THREADS, TEMPERATURE, TOP_P, SEED, DATASET, MODEL) env-var overridable; defaults match the upstream reasoning-eval reference (max_tokens=400000, repeat=32, temperature=1.0, num_threads=512). Warns at startup if HF_TOKEN is unset (GPQA Diamond is HF-gated).

Updated

  • docs/accuracy.md GPQA section → script-based runbook with HF gating note + pointer back to AIME's reasoning-mode env vars
  • src/srtctl/benchmarks/__init__.py — drop gpqa import + __all__ entry
  • 6 existing type: "gpqa" recipes migrated 1:1 to type: custom + bash /configs/gpqa/run.sh, preserving their MAX_TOKENS / REPEAT / NUM_THREADS as env-var overrides
  • 2 h200 recipes' commented-out type: "gpqa" block replaced with a one-line pointer to the new docs

Recipe shape (from new docs section)

benchmark:
  type: custom
  container_image: nemo-skills    # alias from srtslurm.yaml `containers:`
  env:
    OPENAI_API_KEY: "EMPTY"
    HF_TOKEN: "\${HF_TOKEN}"      # REQUIRED: GPQA Diamond is HF-gated
    # MAX_TOKENS / REPEAT / TEMPERATURE / SEED / etc. all overridable
  command: |
    bash /configs/gpqa/run.sh

Backward compatibility

None. Recipes with type: gpqa will fail schema validation. Migration is 1:1 to the block above. The 6 in-tree GPQA recipes are migrated in this PR.

Test plan

  • `make check` — 611 passed, 2 skipped (matches pre-change baseline; drops the one gpqa registry assertion)
  • (post-merge) End-to-end on a GB200 cluster: validate `ns prepare_data gpqa` succeeds with `HF_TOKEN` plumbed through `benchmark.env`, eval lands real `metrics.json`

Follow-up

MMLU / longbenchv2 / gsm8k still to come per #92 — separate focused PRs in the order the issue lays out.

🤖 Generated with Claude Code

Ports GPQA to the same `configs/<bench>/` script-based pattern that AIME
moved to in PR #91, per follow-up issue #92. Same nemo-run unquoting
hazard motivates the move: any benchmark passing Hydra `++overrides`
through `ns eval` is one backslash-bearing extract_regex away from
silently-broken evaluation.

Replaces `src/srtctl/benchmarks/gpqa.py` (sglang.test.run_eval-based) with
a NeMo-Skills-driven `configs/gpqa/run.sh` that recipes wire up via
`type: custom`. Defaults match the upstream reasoning-eval reference
(--benchmarks=gpqa:32, max_tokens=400000, temperature=1.0); all knobs
overridable via env. Script warns at startup if HF_TOKEN is unset since
GPQA Diamond is HF-gated.

Removed:
- `src/srtctl/benchmarks/gpqa.py` (62 LOC)
- `src/srtctl/benchmarks/scripts/gpqa/bench.sh` (53 LOC)
- `BenchmarkType.GPQA` enum
- `gpqa` registry assertion in `tests/test_benchmarks.py`
- GPQA row + section in `docs/config-reference.md`
- `gpqa` mention in `examples/example.yaml` and `docs/architecture.md` tree

Updated:
- `docs/accuracy.md` GPQA section → script-based runbook (recipe shape,
  HF gating note, reasoning-mode env vars pointer back to AIME)
- `src/srtctl/benchmarks/__init__.py` — drop gpqa import + __all__ entry
- 6 existing `type: "gpqa"` recipes migrated 1:1 to `type: custom` +
  `bash /configs/gpqa/run.sh`, preserving their MAX_TOKENS / REPEAT /
  NUM_THREADS as env-var overrides
- 2 h200 recipes' commented-out `type: "gpqa"` block replaced with a
  one-line pointer to the new docs

Backward compat: none. Recipes with `type: gpqa` will fail schema
validation — migration is a 1:1 swap to the `type: custom` block above.

Test plan: `make check` — 611 passed, 2 skipped (matches pre-change
baseline). MMLU / longbenchv2 / gsm8k untouched per the issue's
focused-PR-per-benchmark plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Collaborator

@qiching qiching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants