fix(benchmarks): reasoning-safe token budgets for MMLU/GSM8K (#10199)#10670
Conversation
Validating the standard harnesses against the stated target model gpt-oss-120b (a reasoning model) surfaced a real scorer defect: MMLU (max_tokens=256) and GSM8K (max_tokens=384) truncated the model mid-reasoning, so the visible answer was empty / missing the '#### <int>' line and scored as WRONG — silently depressing the score. Same items, live on Cerebras: MMLU 25 abstract_algebra: 0.48 (11/25 empty) -> 0.92 (0 empty) GSM8K 25: 0.72 -> 1.00 HumanEval (default 2048) was already fine (1.00). - mmlu.py: DEFAULT_MAX_TOKENS 256 -> 2048; emit a loud truncation warning + expose empty_output_rate in raw_json so a partially-empty run is never mistaken for a real low score. A single MCQ letter costs ~1 token, so non-reasoning models stop early and are unaffected — only reasoning headroom changes. - gsm8k.py: runner + CLI --max-tokens default 384 -> 2048. - test_mmlu.py: assert the reasoning-safe default + a new partial-empty test proving the truncation warning + empty_output_rate fire (26 standard tests green). Evidence: .github/issue-evidence/10199-standard-benchmarks-reasoning-fix/SCORECARD.md (reviewed scorecard; raw result JSON is generated + uncommitted per the benchmarks convention). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Your trial has ended. Reactivate Greptile to resume code reviews.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Relates to #10199 (validate the benchmark harnesses themselves, not just the score) and #9943.
The bug (found by running the real target model)
#10199's stated target is gpt-oss-120b — a reasoning model that spends completion tokens on hidden reasoning before the visible answer. Running it live against the standard harnesses surfaced a real scorer defect:
max_tokens=256) and GSM8K (max_tokens=384) truncated the model mid-reasoning, so the visible answer was empty (MMLU) or missing the#### <int>line (GSM8K), scored as wrong — silently depressing a real score.Same items, live on Cerebras (
api.cerebras.ai/v1):max_tokens256→2048max_tokens384→2048Confirmed by inspecting the
failures(all"predicted": "<empty>","empty_visible_output": true) and re-running the identical set with a larger budget.Fix
mmlu.py:DEFAULT_MAX_TOKENS256 → 2048; emit a loud truncation warning + exposeempty_output_rateinraw_jsonso a partially-empty run is never mistaken for a real low score. A single MCQ letter costs ~1 token — non-reasoning models stop early and are unaffected; only reasoning headroom changes.gsm8k.py: runner + CLI--max-tokensdefault 384 → 2048.Verification
Reviewed scorecard:
.github/issue-evidence/10199-standard-benchmarks-reasoning-fix/SCORECARD.md. Raw result JSON/trajectories are generated + not committed (per the benchmarks convention + theverify-artifactsguard from #10664).Scope: fixes the harness truncation for reasoning models + delivers the graded gpt-oss-120b standard-subset rerun. The full 43-benchmark registry rerun + HITL multi-Codex harness remain the larger #10199 scope.
🤖 Generated with Claude Code