Hyper-param tuning for PPO

Thank you for the great work and thank you for the findings on tuning PPO! 

I tried RLHF with a customized deterministic reward function. However, the training rewards are fluctuating. May I ask your suggestion on why this happens? I am suspecting it is because the actor did not explore. I am wondering whether we need to tune the generation kwargs [here](https://github.com/microsoft/DeepSpeedExamples/blob/8f8099a813f3b223d5df39e0c15c748de4eb1669/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py#L73) when the actor generate the output, like tuning temperature. 

Thank you! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyper-param tuning for PPO #532

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hyper-param tuning for PPO #532

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions