Skip to content

Issue running single judgement with references #20

@sersoage

Description

@sersoage

Hi guys first of all thank you for the great paper. I am trying the single case scenario that is where i have a question, a model generated answer and a reference answer. Looking at the code i am using gen_model_judgment_single.py.
The first thing i did was to generate the dataset in the desired format:
"question_id": i,
"question_body": question["question"],
"decoding_method": "top_p_sampling", # Placeholder value
"model": "alpaca-native", # Placeholder value
"text": answer,
"scores": {"logprobs": -7.0179795026779175} #placheholder
I ase generated the reference answer dataset like this
combined_entry = {
"question_id": i,
"question_body": question["question"],
"decoding_method": "top_p_sampling", # Placeholder value
"model": "alpaca-native", # Placeholder value
"reference": {
"text": answer # You can update this with the correct reference text
},
"scores": {
"logprobs": -7.0179795026779175 # place holder
}
}
Then as stated in the repo i ran the judgelm_preprocess.py which generated a json with the following format
{"question_id": 0, "score": [{"logprobs": -7.0179795026779175}, {"logprobs": -7.0179795026779175}], "question_body": "question", "answer1_body": " generated answer, "answer2_body": "reference answer", "answer1_model_id": "alpaca-native", "answer2_model_id": "alpaca-native", "answer1_metadata": {"decoding_method": "top_p_sampling"}, "answer2_metadata": {"decoding_method": "top_p_sampling"}}
First question is it ok for the answer2body to be the reference answer?

Then having this dataset a run:
!python ./judgelm/llm_judge/gen_model_judgement_single.py
--model-path "BAAI/JudgeLM-7B-v1.0"
--model-id 7b-full-model
--question-file /root/JudgeLM/judgelm/data/judgelm-val-5k-judge-samples.jsonl
--answer-file /root/JudgeLM/judgelm/data/JudgeLM/output
--num-gpus-per-model 1
--num-gpus-total 1
--temperature 0
--reference-file /root/JudgeLM/judgelm/data/JudgeLM/combined_questions_answers_ref.jsonl
--if-fast-eval 1
First issue i run into was that since i was using references the copy function of conversation requests num_answers but this is a single one so i had to change the code to add this line
conv = conv_judge_single.copy() if references is None else conv_judge_single_w_reference.copy()
to this line
conv = conv_judge_single.copy() if references is None else conv_judge_single_w_reference.copy(answer_num=answer_num_value)
passing 1 as answer_num_value
So i do not know if this is a bug, if my change is ok?
After changing this a get the code to run however I do not see any judgment on the output, here is a sample output:
{"question_id": 0, "score": [{"logprobs": -7.0179795026779175}, {"logprobs": -7.0179795026779175}], "question_body": "question", "answer1_body": " generated_answer", "answer2_body": reference_answer", "answer1_model_id": "alpaca-native", "answer2_model_id": "alpaca-native", "answer1_metadata": {"decoding_method": "top_p_sampling"}, "answer2_metadata": {"decoding_method": "top_p_sampling"}, "pred_id": "ie5CkG9JTxcCYmAwt3pwrj", "pred_text": "10", "pred_model_id": "7b-full-model", "tstamp": 1703790064.0357897, "reference": "reference_anwer"}
I was wondering if you could help me to properly run this code and point anything i am doing wrong
Best
Sergio

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions