Hi,
I would like to confirm if the following command is the correct way to evaluate the accuracy of AIME24:
TASK=aime24
lighteval vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
--use-chat-template \
--output-dir $OUTPUT_DIR
Should this command be run 64 times to evaluate the AIME24 accuracy to reproduce Deepseek's evaluation results?
Thank you for your help!