GRPO Reproducibility

Hello,

I have noticed some variability in the benchmark scores by running the same GRPO experiment multiple times. I understand that the grpo code is not entirely reproducible, however, have you noticed the variations to be a lot?

Also, have you tried more ways to make the runs more reproducible?

Thank you!