Skip to content

Using ReasonFlux-7B-PRM to score a model #20

@ska278

Description

@ska278

Hi,

I don't know if this question has already been asked and answered, but I'm not quite sure how the scores that the make_step_rewards() function calculates over the logits and the token mask are supposed to be used when using ReasonFlux-7B-PRM to score the output of a Llama-3.1-8B-Instruct model.

Following your code snippet for using ReasonFlux-7B-PRM, this is what I did:
messages = [
{"role": "user", "content": question},
{"role": "assistant", "content": "<extra_0>".join(ans_list) + "<extra_0>"},
]
conversation_str = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False
)
input_ids = tokenizer.encode(
conversation_str, return_tensors="pt"
).to(self.model.device)

Here ans_list is the list of answers from the Llama model. I see that the shape of input_ids is [1, K], where K is number of token-ids. I see that the encoding of "extra<0>", i.e., step_sep_id is 151651. Passing this [1, K] tensor to the PRM results in a [1, K, 152064] logits tensor. Passing this tensor and the token_masks tensor as inputs to make_step_rewards() results in a massive list with 380160 elements having very small probability values. Min and max vlaues in this list are ~4.31e-13 and ~0.01

Given this, how is this list of rewards returned by the make_step_rewards() function expected to be used in scoring answers from the Llama model. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions