Skip to content

PPO training unable to reproduce the training log provided #474

Open
@REIGN12

Description

@REIGN12

Many thanks for you guys great work! I am following your provided training code and scritps trying to reproduce the actor-opt-1.3b-critic-opt-350m results. However, the reward curve is always lies in around ~-4, and it gets bad reward at the very begining of the training. I have run some qualitative study on the 0 step collected experiences:
image
image
Is this behaviour expected? Is this issue related to tokenizer or should we mask out what is after the "<|endoftext|>" in the model generated results?
I have compared the training logs and do some very simple qualitative study on step1 and step2 training, and I do not find any obvious bugs.
Many thanks in advance!!

Metadata

Metadata

Assignees

Labels

deespeed chatDeepSpeed ChatmodelingRelated to modeling questions.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions