step2 bug fix for loss = nan when using BLOOM(which is left padding style)

When we use the bloom model to train the reward model, there may be a situation where the loss is always NaN. This is because the "end_ind" in the reward model is not correctly calculated, causing the "divergence_ind" to always be greater than "end_ind", so it is impossible to obtain the corresponding "chosen_reward" and "rejected_reward".

However, we know that since bloom uses left padding, the end position of "chosen" or "rejected" is always at the last index of the id. Therefore, we can simply **set "end_ind = seq_len**" in the forward function of the reward model, so that the end position can always be correctly obtained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

step2 bug fix for loss = nan when using BLOOM(which is left padding style) #571

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

step2 bug fix for loss = nan when using BLOOM(which is left padding style) #571

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions