Open
Description
Hi, I have read your paper carefully but I'm still very confused how to evaluate your benchmark on other models. Can you share a real example about 'Listing 1: Tutor Evaluation Prompt'? For example, how is the full_conversation formatted, what are the airesponse and student_response referred to? A conversation has many student responses, do you just evaluate the last one?
Metadata
Metadata
Assignees
Labels
No labels