Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR fixes an answer post-processing issue in the GSM8K evaluation of OpenCompass.
During evaluation, we found that a large portion of incorrect predictions were actually correct answers that were mistakenly judged as wrong due to a formatting issue in the post-processing step.
Among approximately 130 incorrect cases, around 50 cases share the same pattern:
the model outputs the correct numeric answer, but the evaluation pipeline fails to correctly extract it.
This is not an instruction-following problem, but a post-processing bug.
In GSM8K-style prompts, monetary values frequently appear with comma formatting such as:
"origin_prompt": "Question: Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?\nLet's think step by step\nAnswer:",
"prediction": "Josh buys the house for $80,000 and spends $50,000 on repairs, so his total investment is:\n\n$80,000 (purchase) + $50,000 (repairs) = $130,000 total invested.\n\nThe repairs increased the value of the house by 150%, meaning the house’s value increased by:\n\n150% of $80,000 = 1.5 × $80,000 = $120,000 increase in value.\n\nSo the new value of the house is:\n\n$80,000 (original value) + $120,000 (increase) = $200,000.\n\nJosh’s profit is the final value minus his total investment:\n\n$200,000 (final value) - $130,000 (total investment) = $70,000 profit.\n\nThe answer is $70,000.",
"gold": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,0001.5=<<800001.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000"
},
print info:
"pred": ["000"],
"answer": ["70000"],
"correct": [false]
Modification
This PR updates the GSM8K post-processing logic in:opencompass/opencompass/datasets/gsm8k.py
Specifically, the
gsm8k_postprocessfunction is modified to correctly handle numbers formatted with commas (e.g.,70,000).The updated logic normalizes such formats by removing commas before performing numeric extraction, ensuring that the correct value is parsed.
Result
After applying this fix, GSM8K evaluation accuracy improves: 90.52 → 95.00
A significant portion of previously incorrect cases are now correctly evaluated.
Related Issue
Fixes #2343
Checklist
Before PR
After PR