feat: new benchmark OlympiadBench #16

W3en2g · 2025-04-01T12:48:27Z

Description

The OlympiadBench was published at February 2024. Similar to Mathvision, they only tested a few closed-source models.

Their code does not include prompts or hyperparameter settings for existing open-source models.

Regarding prompts, we use the prompt for Qwen-VL from their OlympiadBench GitHub.

Current test results are as follows:
For Qwen/Qwen2-VL-72B-Instruct, Qwen's official score is 11.2
Using greedy decoding, I achieved a score of 0.11104347826086956.

The scores for qvq are still under testing, but qvq as a reasoning model requires very long inference time. Additionally, among open-source models, I could only find scores for Qwen2VL-72B. Therefore, after testing Qwen2VL-72B and obtaining scores close to the Qwen official report, I believe the current results are acceptable.

Furthermore, OlympiadBench has two types of problems: Open-ended questions with ground truth, and Theorem proof problems. Although this dataset provides Theorem proof problems, they mentioned in their paper that they cannot automatically evaluate Theorem proof problems, so currently only Open-ended questions have been integrated.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.
You can use the syntax close #1314520 if this solves the issue #15213

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide. (required)
My change requires a change to the documentation.
I have updated the tests accordingly. (required for a bug fix or a new feature)
I have updated the documentation accordingly.

eval_anything/dataloader/format_mm_dataset.py

eval_anything/evaluate_tools/t2t_tools.py

Lavezlyn · 2025-04-04T02:48:35Z

eval_anything/benchmarks/text_image_to_text/olympiadbench/configs.yaml

+    data_files: null
+    <<: *task_defaults
+answer_extractor:
+  - name: match_multi-choice_and_open-ended


The name should match the function.

W3en2g and others added 2 commits April 1, 2025 20:29

feat: new benchmark OlympiadBench

0fb9768

Merge branch 'main' into OlympiadBench

dd42a0a

Kass123777 reviewed Apr 1, 2025

View reviewed changes

eval_anything/dataloader/format_mm_dataset.py Outdated Show resolved Hide resolved

Kass123777 reviewed Apr 1, 2025

View reviewed changes

eval_anything/evaluate_tools/t2t_tools.py Outdated Show resolved Hide resolved

Lavezlyn reviewed Apr 4, 2025

View reviewed changes

fix: refine OlympiadBench evaluate settings

0da2af5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: new benchmark OlympiadBench #16

feat: new benchmark OlympiadBench #16

Uh oh!

W3en2g commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lavezlyn Apr 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: new benchmark OlympiadBench #16

Are you sure you want to change the base?

feat: new benchmark OlympiadBench #16

Uh oh!

Conversation

W3en2g commented Apr 1, 2025

Description

Motivation and Context

Types of changes

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lavezlyn Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants