Open
Description
Thank you for the great work!
The kl rewards seem to be computed each time calling train_rlhf(). [code]
def train_rlhf(self, inputs):
# train the rlhf mode here
### process the old outputs
prompts = inputs['prompts']
log_probs = inputs['logprobs']
ref_log_probs = inputs['ref_logprobs']
reward_score = inputs['rewards']
values = inputs['value']
attention_mask = inputs['attention_mask']
seq = inputs['input_ids']
start = prompts.size()[-1] - 1
action_mask = attention_mask[:, 1:]
old_values = values
with torch.no_grad():
old_rewards = self.compute_rewards(prompts, log_probs,
ref_log_probs, reward_score,
action_mask)
Both log_probs
and ref_log_probs
are from buffer, which means old_rewards
is always same for the same episode?
Did I make any mistake?