Open
Description
Many thanks for you guys great work! I am following your provided training code and scritps trying to reproduce the actor-opt-1.3b-critic-opt-350m results. However, the reward curve is always lies in around ~-4, and it gets bad reward at the very begining of the training. I have run some qualitative study on the 0 step collected experiences:
Is this behaviour expected? Is this issue related to tokenizer or should we mask out what is after the "<|endoftext|>" in the model generated results?
I have compared the training logs and do some very simple qualitative study on step1 and step2 training, and I do not find any obvious bugs.
Many thanks in advance!!