Open
Description
Thank you for the great work and thank you for the findings on tuning PPO!
I tried RLHF with a customized deterministic reward function. However, the training rewards are fluctuating. May I ask your suggestion on why this happens? I am suspecting it is because the actor did not explore. I am wondering whether we need to tune the generation kwargs here when the actor generate the output, like tuning temperature.
Thank you!