The GSM8K evaluation result of Llama-3-8B-instruct model is only 0.44

Are there any important considerations when evaluating the Llama-3-8B-instruct model? My evaluation result is only 0.44.
I use the command:
~~~
python eval_math.py --model /xxx/pretrain/Meta-Llama-3-8B-Instruct --data_file /xxx/eval_math/GSM8K_test_data.jsonl --save_path /xxx/Meta-Llama-3-8B-Instruct.json --tensor_parallel_size 8 --seed 42
~~~

![image](https://github.com/user-attachments/assets/2a0273c5-5ac6-4d47-809a-614b328049b7)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The GSM8K evaluation result of Llama-3-8B-instruct model is only 0.44 #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The GSM8K evaluation result of Llama-3-8B-instruct model is only 0.44 #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions