Question about evaluating AIME24 Accuracy

Hi,

I would like to confirm if the following command is the correct way to evaluate the accuracy of AIME24:

```
TASK=aime24
lighteval vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
    --use-chat-template \
    --output-dir $OUTPUT_DIR

```
Should this command be run 64 times to evaluate the AIME24 accuracy to reproduce Deepseek's evaluation results？
Thank you for your help!