Hi,
Usually when hosting the model, we do not accumulate thinking, that is we remove the thinking content from turn 1 to turn t - 1 when generating responses for turn t. However, in verl, the final training sample contains everything (the thinking content and responses for every turn). This leads to a train-serve mismatch.
I looked around in verl and did not find an existing solution to this. An ideal solution would be to kind of "explode" each sample into T sub-samples with T being the number of assistant turns in the sample. Within each sub-sample, the thinking content of previous turns are stripped.
The reward and adv calculation should still be sample-based and they should be propagated to the exploded sub-samples.
Can anyone confirm that this is indeed a problem in verl and help with this?
Thanks in advance