Skip to content

Conversation

@W3en2g
Copy link
Contributor

@W3en2g W3en2g commented Apr 1, 2025

Description

Add new benchmark OlympiadBench

The OlympiadBench was published at February 2024. Similar to Mathvision, they only tested a few closed-source models.

Their code does not include prompts or hyperparameter settings for existing open-source models.

Regarding prompts, we use the prompt for Qwen-VL from their OlympiadBench GitHub.

Current test results are as follows:
For Qwen/Qwen2-VL-72B-Instruct, Qwen's official score is 11.2
Using greedy decoding, I achieved a score of 0.11104347826086956.

The scores for qvq are still under testing, but qvq as a reasoning model requires very long inference time. Additionally, among open-source models, I could only find scores for Qwen2VL-72B. Therefore, after testing Qwen2VL-72B and obtaining scores close to the Qwen official report, I believe the current results are acceptable.

Furthermore, OlympiadBench has two types of problems: Open-ended questions with ground truth, and Theorem proof problems. Although this dataset provides Theorem proof problems, they mentioned in their paper that they cannot automatically evaluate Theorem proof problems, so currently only Open-ended questions have been integrated.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #1314520 if this solves the issue #15213

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

data_files: null
<<: *task_defaults
answer_extractor:
- name: match_multi-choice_and_open-ended
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should match the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants