Hi,
I was wondering if future iterations could be benchmarked with MedCalc-Bench Verified instead of v1.0? We've corrected almost 1/3 of the labels which had either incorrect computation or incorrect extraction of relevant entities, both of which affect the ground truth. We also made improvements like finding notes which may be better applications for certain calculators and re-writing notes so that they read more closely to writing style of the notes from PMC using o4-mini: https://huggingface.co/datasets/nsk7153/MedCalc-Bench-Verified.
I know I posted this on the HELM GitHub issues too, but I just want everyone to use the most accurate version to date since the changes can affect the ranking of stronger models.
Thanks for taking the time to read my request!