Open
Description
When we use the bloom model to train the reward model, there may be a situation where the loss is always NaN. This is because the "end_ind" in the reward model is not correctly calculated, causing the "divergence_ind" to always be greater than "end_ind", so it is impossible to obtain the corresponding "chosen_reward" and "rejected_reward".
However, we know that since bloom uses left padding, the end position of "chosen" or "rejected" is always at the last index of the id. Therefore, we can simply set "end_ind = seq_len" in the forward function of the reward model, so that the end position can always be correctly obtained.