-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Hi,
I don't know if this question has already been asked and answered, but I'm not quite sure how the scores that the make_step_rewards() function calculates over the logits and the token mask are supposed to be used when using ReasonFlux-7B-PRM to score the output of a Llama-3.1-8B-Instruct model.
Following your code snippet for using ReasonFlux-7B-PRM, this is what I did:
messages = [
{"role": "user", "content": question},
{"role": "assistant", "content": "<extra_0>".join(ans_list) + "<extra_0>"},
]
conversation_str = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)
input_ids = tokenizer.encode(
conversation_str, return_tensors="pt"
).to(self.model.device)
Here ans_list is the list of answers from the Llama model. I see that the shape of input_ids is [1, K], where K is number of token-ids. I see that the encoding of "extra<0>", i.e., step_sep_id is 151651. Passing this [1, K] tensor to the PRM results in a [1, K, 152064] logits tensor. Passing this tensor and the token_masks tensor as inputs to make_step_rewards() results in a massive list with 380160 elements having very small probability values. Min and max vlaues in this list are ~4.31e-13 and ~0.01
Given this, how is this list of rewards returned by the make_step_rewards() function expected to be used in scoring answers from the Llama model. Thanks.