Thanks for your fantastic work!
It seems that Wandb only provides training information for SFT and does not provide relevant training information for PPO. Is the provided Wandb link the training process of PPO? Or is the PPO training process after SFT?
It is difficult to reproduce PPO results using the default parameters in train_ppo.sh and the same environment, and even training cannot converge.
Can you give me some advice?