-
Notifications
You must be signed in to change notification settings - Fork 437
Closed
Description
Describe the bug
The coqa benchmark failed silently (0 score) because it was configured with Metrics.exact_match, which is too strict for conversational QA. Additionally, the prompt function needed logic to handle one-to-many document generation (one story -> multiple Q&A pairs), which the core logic did not support efficiently.
To Reproduce
from lighteval.tasks.tasks.coqa import coqa_first_question
# Check metric config
print(coqa_first_question.metrics) # Was [Metrics.exact_match]Expected behavior
- Metrics should be
Metrics.f1_score. - The core logic (
lighteval_task.py) should support formatters returning multipleDocobjects per item.
Version info
- OS: mac
- Lighteval version: main (local development)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels