PPO training unable to reproduce the training log provided

Many thanks for you guys great work! I am following your provided training code and scritps trying to reproduce the actor-opt-1.3b-critic-opt-350m results. However, the reward curve is always lies in around ~-4, and it gets bad reward at the very begining of the training. I have run some qualitative study on the 0 step collected experiences:
![image](https://user-images.githubusercontent.com/67776176/236140327-1dd0a628-4903-41e9-bf8a-6c36c062c62e.png)
![image](https://user-images.githubusercontent.com/67776176/236140355-ff0768fe-8160-48aa-b718-566eb81dceae.png)
Is this behaviour expected? Is this issue related to tokenizer or should we mask out what is after the "<|endoftext|>" in the model generated results?
I have compared the training logs and do some very simple qualitative study on step1 and step2 training, and I do not find any obvious bugs.
Many thanks in advance!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PPO training unable to reproduce the training log provided #474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PPO training unable to reproduce the training log provided #474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions