Possibility to benchmark with MedCalc-Bench Verified instead of v1.0?

Hi, 

I was wondering if future iterations could be benchmarked with MedCalc-Bench Verified instead of v1.0? We've corrected almost 1/3 of the labels which had either incorrect computation or incorrect extraction of relevant entities, both of which affect the ground truth. We also made improvements like finding notes which may be better applications for certain calculators and re-writing notes so that they read more closely to writing style of the notes from PMC using o4-mini: https://huggingface.co/datasets/nsk7153/MedCalc-Bench-Verified.

I know I posted this on the HELM GitHub issues too, but I just want everyone to use the most accurate version to date since the changes can affect the ranking of stronger models. 

Thanks for taking the time to read my request! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to benchmark with MedCalc-Bench Verified instead of v1.0? #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possibility to benchmark with MedCalc-Bench Verified instead of v1.0? #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions