[BugFix] Fix gsm8k postprocess by Hibbert133 · Pull Request #2426 · open-compass/opencompass

Hibbert133 · 2026-03-30T09:34:50Z

Motivation

This PR fixes an answer post-processing issue in the GSM8K evaluation of OpenCompass.

During evaluation, we found that a large portion of incorrect predictions were actually correct answers that were mistakenly judged as wrong due to a formatting issue in the post-processing step.

Among approximately 130 incorrect cases, around 50 cases share the same pattern:
the model outputs the correct numeric answer, but the evaluation pipeline fails to correctly extract it.

This is not an instruction-following problem, but a post-processing bug.

In GSM8K-style prompts, monetary values frequently appear with comma formatting such as:

"origin_prompt": "Question: Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?\nLet's think step by step\nAnswer:",
"prediction": "Josh buys the house for $80,000 and spends $50,000 on repairs, so his total investment is:\n\n$80,000 (purchase) + $50,000 (repairs) = $130,000 total invested.\n\nThe repairs increased the value of the house by 150%, meaning the house’s value increased by:\n\n150% of $80,000 = 1.5 × $80,000 = $120,000 increase in value.\n\nSo the new value of the house is:\n\n$80,000 (original value) + $120,000 (increase) = $200,000.\n\nJosh’s profit is the final value minus his total investment:\n\n$200,000 (final value) - $130,000 (total investment) = $70,000 profit.\n\nThe answer is $70,000.",
"gold": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,0001.5=<<800001.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000"
},

print info：

"pred": ["000"],
"answer": ["70000"],
"correct": [false]

Modification

This PR updates the GSM8K post-processing logic in:opencompass/opencompass/datasets/gsm8k.py

Specifically, the gsm8k_postprocess function is modified to correctly handle numbers formatted with commas (e.g., 70,000).

The updated logic normalizes such formats by removing commas before performing numeric extraction, ensuring that the correct value is parsed.

Result

After applying this fix, GSM8K evaluation accuracy improves: 90.52 → 95.00

A significant portion of previously incorrect cases are now correctly evaluated.

Related Issue

Fixes #2343

Checklist

Before PR

Pre-commit or other linting tools are used to fix potential lint issues.
Bug fixes are fully covered by unit tests, including the case that triggers this issue.
The modification is covered by complete unit tests to ensure correctness.
Documentation has been updated accordingly (e.g., docstrings).

After PR

If the modification has potential influence on downstream or related projects, those projects should also be tested.
CLA has been signed and all committers have signed the CLA for this PR.

[bugfix] Fix gsm8k postprocess

73e1054

mm-assistant Bot assigned bittersweet1999 Mar 30, 2026

Hibbert133 changed the title ~~[bugfix] Fix gsm8k postprocess~~ [BugFix] Fix gsm8k postprocess Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix gsm8k postprocess#2426

[BugFix] Fix gsm8k postprocess#2426
Hibbert133 wants to merge 1 commit into
open-compass:mainfrom
Hibbert133:fix_gsm8k_postprocess

Hibbert133 commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Hibbert133 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Result

Related Issue

Checklist

Before PR

After PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Hibbert133 commented Mar 30, 2026 •

edited

Loading