Open
Description
🚀 Feature Request
Provide a metric that uses Math-Verify to parse and compare mathematical expressions with more flexibility than InContextLearningGenerationExactMatchAccuracy
.
Motivation
https://huggingface.co/blog/math_verify_leaderboard reports that overly simple methods for evaluating LLM math performance can give very misleading results, which Math-Verify addresses.
[Optional] Implementation
Create a MathVerifyAccuracy
class that inherits from InContextLearningMetric
, in llmfoundry/eval/metrics/nlp.py or perhaps a new llmfoundry/eval/metrics/math.py. The implementation of that class is relatively straightforward, and I would be happy to carry it out if desired.