Train serve mismatch when thinking content is not accumulated in multi-turn training

Hi,
Usually when hosting the model, we do not accumulate thinking, that is we remove the thinking content from turn 1 to turn t - 1 when generating responses for turn t. However, in verl, the final training sample contains everything (the thinking content and responses for every turn). This leads to a train-serve mismatch. 

I looked around in verl and did not find an existing solution to this. An ideal solution would be to kind of "explode" each sample into T sub-samples with T being the number of assistant turns in the sample. Within each sub-sample, the thinking content of previous turns are stripped.

The reward and adv calculation should still be sample-based and they should be propagated to the exploded sub-samples.

Can anyone confirm that this is indeed a problem in verl and help with this?

Thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train serve mismatch when thinking content is not accumulated in multi-turn training #5576

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Train serve mismatch when thinking content is not accumulated in multi-turn training #5576

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions