Skip to content

wandb and difficult to reproduce #12

@lujiarui-iie

Description

@lujiarui-iie

Thanks for your fantastic work!

It seems that Wandb only provides training information for SFT and does not provide relevant training information for PPO. Is the provided Wandb link the training process of PPO? Or is the PPO training process after SFT?

It is difficult to reproduce PPO results using the default parameters in train_ppo.sh and the same environment, and even training cannot converge.

Can you give me some advice?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions