wandb and difficult to reproduce

Thanks for your fantastic work! 

It seems that Wandb only provides training information for SFT and does not provide relevant training information for PPO. Is the provided Wandb link the training process of PPO? Or is the PPO training process after SFT?

It is difficult to reproduce PPO results using the default parameters in train_ppo.sh and the same environment, and even training cannot converge.

Can you give me some advice?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wandb and difficult to reproduce #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

wandb and difficult to reproduce #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions