Skip to content

Question about evaluating AIME24 Accuracy #698

@ZJUCQR

Description

@ZJUCQR

Hi,

I would like to confirm if the following command is the correct way to evaluate the accuracy of AIME24:

TASK=aime24
lighteval vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

Should this command be run 64 times to evaluate the AIME24 accuracy to reproduce Deepseek's evaluation results?
Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions