Open
Description
Thanks for your work!
I'm curious about the evaluation results you reported about Qwen-2.5-Math-7B-Instruct and hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero:

I use your script to evaluate both Qwen-2.5-Math-7B-Instruct and hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero models and get the following results:
The gaps are pretty large such that it will diminish your improvements of Qwen2.5-7B-SimpleRL-Zero if compared to Qwen-2.5-Math-7B-Instruct.
Metadata
Metadata
Assignees
Labels
No labels
Activity