Skip to content

Rewards in ppo seem to be recomputed many times #528

Open
@dwyzzy

Description

@dwyzzy

Thank you for the great work!
The kl rewards seem to be computed each time calling train_rlhf(). [code]

    def train_rlhf(self, inputs):
        # train the rlhf mode here
        ### process the old outputs
        prompts = inputs['prompts']
        log_probs = inputs['logprobs']
        ref_log_probs = inputs['ref_logprobs']
        reward_score = inputs['rewards']
        values = inputs['value']
        attention_mask = inputs['attention_mask']
        seq = inputs['input_ids']

        start = prompts.size()[-1] - 1
        action_mask = attention_mask[:, 1:]

        old_values = values
        with torch.no_grad():
            old_rewards = self.compute_rewards(prompts, log_probs,
                                               ref_log_probs, reward_score,
                                               action_mask)

Both log_probs and ref_log_probs are from buffer, which means old_rewards is always same for the same episode?
Did I make any mistake?

Metadata

Metadata

Assignees

Labels

deespeed chatDeepSpeed ChatmodelingRelated to modeling questions.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions