Rewards in ppo seem to be recomputed many times

Thank you for the great work!
The kl rewards seem to be computed each time calling train_rlhf(). [[code](https://github.com/microsoft/DeepSpeedExamples/blob/8f8099a813f3b223d5df39e0c15c748de4eb1669/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py#L159)]

```
    def train_rlhf(self, inputs):
        # train the rlhf mode here
        ### process the old outputs
        prompts = inputs['prompts']
        log_probs = inputs['logprobs']
        ref_log_probs = inputs['ref_logprobs']
        reward_score = inputs['rewards']
        values = inputs['value']
        attention_mask = inputs['attention_mask']
        seq = inputs['input_ids']

        start = prompts.size()[-1] - 1
        action_mask = attention_mask[:, 1:]

        old_values = values
        with torch.no_grad():
            old_rewards = self.compute_rewards(prompts, log_probs,
                                               ref_log_probs, reward_score,
                                               action_mask)
```
Both `log_probs` and `ref_log_probs` are from buffer, which means `old_rewards` is always same for the same episode?
Did I make any mistake?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewards in ppo seem to be recomputed many times #528

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rewards in ppo seem to be recomputed many times #528

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions