fix(benchmarks): reasoning-safe token budgets for MMLU/GSM8K (#10199) by lalalune · Pull Request #10670 · elizaOS/eliza

lalalune · 2026-07-01T05:06:47Z

Relates to #10199 (validate the benchmark harnesses themselves, not just the score) and #9943.

The bug (found by running the real target model)

#10199's stated target is gpt-oss-120b — a reasoning model that spends completion tokens on hidden reasoning before the visible answer. Running it live against the standard harnesses surfaced a real scorer defect:

MMLU (max_tokens=256) and GSM8K (max_tokens=384) truncated the model mid-reasoning, so the visible answer was empty (MMLU) or missing the #### <int> line (GSM8K), scored as wrong — silently depressing a real score.

Same items, live on Cerebras (api.cerebras.ai/v1):

Benchmark	before	after	delta cause
MMLU (25, abstract_algebra)	0.48, 11/25 empty visible	0.92, 0 empty	`max_tokens` 256→2048
GSM8K (25)	0.72	1.00	`max_tokens` 384→2048
HumanEval (15)	1.00 (already ok at 2048)	1.00	unaffected

Confirmed by inspecting the failures (all "predicted": "<empty>", "empty_visible_output": true) and re-running the identical set with a larger budget.

Fix

mmlu.py: DEFAULT_MAX_TOKENS 256 → 2048; emit a loud truncation warning + expose empty_output_rate in raw_json so a partially-empty run is never mistaken for a real low score. A single MCQ letter costs ~1 token — non-reasoning models stop early and are unaffected; only reasoning headroom changes.
gsm8k.py: runner + CLI --max-tokens default 384 → 2048.

Verification

$ python -m pytest benchmarks/standard/tests/test_mmlu.py benchmarks/standard/tests/test_gsm8k.py
26 passed   # incl. a new partial-empty test asserting the truncation warning + rate

# live, NEW defaults (no flags), real gpt-oss-120b:
MMLU  25 → 0.92, empty_outputs 0     GSM8K 25 → 1.00     HumanEval 15 → 1.00

Reviewed scorecard: .github/issue-evidence/10199-standard-benchmarks-reasoning-fix/SCORECARD.md. Raw result JSON/trajectories are generated + not committed (per the benchmarks convention + the verify-artifacts guard from #10664).

Scope: fixes the harness truncation for reasoning models + delivers the graded gpt-oss-120b standard-subset rerun. The full 43-benchmark registry rerun + HITL multi-Codex harness remain the larger #10199 scope.

🤖 Generated with Claude Code

Validating the standard harnesses against the stated target model gpt-oss-120b (a reasoning model) surfaced a real scorer defect: MMLU (max_tokens=256) and GSM8K (max_tokens=384) truncated the model mid-reasoning, so the visible answer was empty / missing the '#### <int>' line and scored as WRONG — silently depressing the score. Same items, live on Cerebras: MMLU 25 abstract_algebra: 0.48 (11/25 empty) -> 0.92 (0 empty) GSM8K 25: 0.72 -> 1.00 HumanEval (default 2048) was already fine (1.00). - mmlu.py: DEFAULT_MAX_TOKENS 256 -> 2048; emit a loud truncation warning + expose empty_output_rate in raw_json so a partially-empty run is never mistaken for a real low score. A single MCQ letter costs ~1 token, so non-reasoning models stop early and are unaffected — only reasoning headroom changes. - gsm8k.py: runner + CLI --max-tokens default 384 -> 2048. - test_mmlu.py: assert the reasoning-safe default + a new partial-empty test proving the truncation warning + empty_output_rate fire (26 standard tests green). Evidence: .github/issue-evidence/10199-standard-benchmarks-reasoning-fix/SCORECARD.md (reviewed scorecard; raw result JSON is generated + uncommitted per the benchmarks convention). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps

Your trial has ended. Reactivate Greptile to resume code reviews.

coderabbitai · 2026-07-01T05:06:58Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4d9740c-2f0a-4519-947a-d5585c1ed393

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/10199-mmlu-reasoning-model-truncation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

greptile-apps Bot reviewed Jul 1, 2026

View reviewed changes

lalalune merged commit 45ca374 into develop Jul 1, 2026
10 of 35 checks passed

lalalune deleted the fix/10199-mmlu-reasoning-model-truncation branch July 1, 2026 05:07

lalalune mentioned this pull request Jul 1, 2026

benchmarks: full gpt-oss-120b rerun + HITL multi-Codex harness for Hermes/OpenClaw/elizaOS/Smithers #10199

Open

NubsCarson mentioned this pull request Jul 1, 2026

📋 Open-issues kanban — agent coordination board (all open work, verified 2026-06-30) #10561

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(benchmarks): reasoning-safe token budgets for MMLU/GSM8K (#10199)#10670

fix(benchmarks): reasoning-safe token budgets for MMLU/GSM8K (#10199)#10670
lalalune merged 1 commit into
developfrom
fix/10199-mmlu-reasoning-model-truncation

lalalune commented Jul 1, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

coderabbitai Bot commented Jul 1, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lalalune commented Jul 1, 2026

The bug (found by running the real target model)

Fix

Verification

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jul 1, 2026

Review skipped

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant