Using ReasonFlux-7B-PRM to score a model

Hi,

I don't know if this question has already been asked and answered, **but I'm not quite sure how the scores that the make_step_rewards()** function calculates over the logits and the token mask are supposed to be used when using ReasonFlux-7B-PRM to score the output of a Llama-3.1-8B-Instruct model. 

Following your code snippet for using ReasonFlux-7B-PRM, this is what I did:
                messages = [
                  {"role": "user", "content": question},
                  {"role": "assistant", "content": "<extra_0>".join(ans_list) + "<extra_0>"},
                ]
                conversation_str = tokenizer.apply_chat_template(
                    messages,
                    tokenize=False,
                    add_generation_prompt=False
                )
                input_ids = tokenizer.encode(
                    conversation_str, return_tensors="pt"
                ).to(self.model.device)

Here ans_list is the list of answers from the Llama model. I see that the shape of input_ids is [1, K], where K is number of token-ids. I see that the encoding of "extra<0>", i.e., step_sep_id is 151651. Passing this [1, K] tensor to the PRM results in a [1, K, 152064] logits tensor. Passing this tensor and the token_masks tensor as inputs to make_step_rewards() results in a massive list with **380160** elements having very small probability values. Min and max vlaues in this list are **~4.31e-13** and **~0.01**

Given this, how is this list of rewards returned by the make_step_rewards() function expected to be used in scoring answers from the Llama model. Thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using ReasonFlux-7B-PRM to score a model #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using ReasonFlux-7B-PRM to score a model #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions