Skip to content

Train serve mismatch when thinking content is not accumulated in multi-turn training #5576

@sneng1997

Description

@sneng1997

Hi,
Usually when hosting the model, we do not accumulate thinking, that is we remove the thinking content from turn 1 to turn t - 1 when generating responses for turn t. However, in verl, the final training sample contains everything (the thinking content and responses for every turn). This leads to a train-serve mismatch.

I looked around in verl and did not find an existing solution to this. An ideal solution would be to kind of "explode" each sample into T sub-samples with T being the number of assistant turns in the sample. Within each sub-sample, the thinking content of previous turns are stripped.

The reward and adv calculation should still be sample-based and they should be propagated to the exploded sub-samples.

Can anyone confirm that this is indeed a problem in verl and help with this?

Thanks in advance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions