Skip to content

fix(benchmarks): reasoning-safe token budgets for MMLU/GSM8K (#10199)#10670

Merged
lalalune merged 1 commit into
developfrom
fix/10199-mmlu-reasoning-model-truncation
Jul 1, 2026
Merged

fix(benchmarks): reasoning-safe token budgets for MMLU/GSM8K (#10199)#10670
lalalune merged 1 commit into
developfrom
fix/10199-mmlu-reasoning-model-truncation

Conversation

@lalalune

@lalalune lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member

Relates to #10199 (validate the benchmark harnesses themselves, not just the score) and #9943.

The bug (found by running the real target model)

#10199's stated target is gpt-oss-120b — a reasoning model that spends completion tokens on hidden reasoning before the visible answer. Running it live against the standard harnesses surfaced a real scorer defect:

  • MMLU (max_tokens=256) and GSM8K (max_tokens=384) truncated the model mid-reasoning, so the visible answer was empty (MMLU) or missing the #### <int> line (GSM8K), scored as wrong — silently depressing a real score.

Same items, live on Cerebras (api.cerebras.ai/v1):

Benchmark before after delta cause
MMLU (25, abstract_algebra) 0.48, 11/25 empty visible 0.92, 0 empty max_tokens 256→2048
GSM8K (25) 0.72 1.00 max_tokens 384→2048
HumanEval (15) 1.00 (already ok at 2048) 1.00 unaffected

Confirmed by inspecting the failures (all "predicted": "<empty>", "empty_visible_output": true) and re-running the identical set with a larger budget.

Fix

  • mmlu.py: DEFAULT_MAX_TOKENS 256 → 2048; emit a loud truncation warning + expose empty_output_rate in raw_json so a partially-empty run is never mistaken for a real low score. A single MCQ letter costs ~1 token — non-reasoning models stop early and are unaffected; only reasoning headroom changes.
  • gsm8k.py: runner + CLI --max-tokens default 384 → 2048.

Verification

$ python -m pytest benchmarks/standard/tests/test_mmlu.py benchmarks/standard/tests/test_gsm8k.py
26 passed   # incl. a new partial-empty test asserting the truncation warning + rate

# live, NEW defaults (no flags), real gpt-oss-120b:
MMLU  25 → 0.92, empty_outputs 0     GSM8K 25 → 1.00     HumanEval 15 → 1.00

Reviewed scorecard: .github/issue-evidence/10199-standard-benchmarks-reasoning-fix/SCORECARD.md. Raw result JSON/trajectories are generated + not committed (per the benchmarks convention + the verify-artifacts guard from #10664).

Scope: fixes the harness truncation for reasoning models + delivers the graded gpt-oss-120b standard-subset rerun. The full 43-benchmark registry rerun + HITL multi-Codex harness remain the larger #10199 scope.

🤖 Generated with Claude Code

Validating the standard harnesses against the stated target model gpt-oss-120b
(a reasoning model) surfaced a real scorer defect: MMLU (max_tokens=256) and
GSM8K (max_tokens=384) truncated the model mid-reasoning, so the visible
answer was empty / missing the '#### <int>' line and scored as WRONG —
silently depressing the score. Same items, live on Cerebras:
  MMLU  25 abstract_algebra: 0.48 (11/25 empty) -> 0.92 (0 empty)
  GSM8K 25:                   0.72 -> 1.00
HumanEval (default 2048) was already fine (1.00).

- mmlu.py: DEFAULT_MAX_TOKENS 256 -> 2048; emit a loud truncation warning +
  expose empty_output_rate in raw_json so a partially-empty run is never
  mistaken for a real low score. A single MCQ letter costs ~1 token, so
  non-reasoning models stop early and are unaffected — only reasoning headroom
  changes.
- gsm8k.py: runner + CLI --max-tokens default 384 -> 2048.
- test_mmlu.py: assert the reasoning-safe default + a new partial-empty test
  proving the truncation warning + empty_output_rate fire (26 standard tests
  green).

Evidence: .github/issue-evidence/10199-standard-benchmarks-reasoning-fix/SCORECARD.md
(reviewed scorecard; raw result JSON is generated + uncommitted per the
benchmarks convention).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your trial has ended. Reactivate Greptile to resume code reviews.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4d9740c-2f0a-4519-947a-d5585c1ed393

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/10199-mmlu-reasoning-model-truncation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lalalune lalalune merged commit 45ca374 into develop Jul 1, 2026
10 of 35 checks passed
@lalalune lalalune deleted the fix/10199-mmlu-reasoning-model-truncation branch July 1, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant