Skip to content

Hyper-param tuning for PPO #532

Open
@luzai

Description

@luzai

Thank you for the great work and thank you for the findings on tuning PPO!

I tried RLHF with a customized deterministic reward function. However, the training rewards are fluctuating. May I ask your suggestion on why this happens? I am suspecting it is because the actor did not explore. I am wondering whether we need to tune the generation kwargs here when the actor generate the output, like tuning temperature.

Thank you!

Metadata

Metadata

Assignees

Labels

deespeed chatDeepSpeed ChatmodelingRelated to modeling questions.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions