Skip to content

Confusion on how to evaluate models using this benchmark. #2

Open
@shiwk20

Description

@shiwk20

Hi, I have read your paper carefully but I'm still very confused how to evaluate your benchmark on other models. Can you share a real example about 'Listing 1: Tutor Evaluation Prompt'? For example, how is the full_conversation formatted, what are the airesponse and student_response referred to? A conversation has many student responses, do you just evaluate the last one?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions