Open
Description
When training the ppo model, I turned on the gradient_checkpointing_enable. If you want to calculate ptx loss, then actor will forward twice. In your code, these two loss are executed backward once separately, which will not be any problem. However, if I add these two loss and then use the engine's backward, then the error "gradient computed twice for this partition" will appear. If I don't use the option of gradient_checkpointing_enable, this error will not occur. This error seems to only appear in the zero mode of deepspeed, and I don't know why.