Questions about evaluation results

Thanks for your work!

I'm curious about the evaluation results you reported about **Qwen-2.5-Math-7B-Instruct** and **hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero**:

<img width="992" alt="Image" src="https://github.com/user-attachments/assets/c0299514-102a-4bd4-9ab3-4c534c944bfe" />

I use your script to evaluate both **Qwen-2.5-Math-7B-Instruct** and **hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero** models and get the following results:
<img width="1237" alt="Image" src="https://github.com/user-attachments/assets/367159e2-58ab-494d-b4e3-3a7fb89ed31d" />

The gaps are pretty large such that it will diminish your improvements of **Qwen2.5-7B-SimpleRL-Zero** if compared to  **Qwen-2.5-Math-7B-Instruct**.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about evaluation results #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about evaluation results #46

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions