Skip to content

Fix CoQA metric and multi-doc loading #1156

@pjavanrood

Description

@pjavanrood

Describe the bug

The coqa benchmark failed silently (0 score) because it was configured with Metrics.exact_match, which is too strict for conversational QA. Additionally, the prompt function needed logic to handle one-to-many document generation (one story -> multiple Q&A pairs), which the core logic did not support efficiently.

To Reproduce

from lighteval.tasks.tasks.coqa import coqa_first_question
# Check metric config
print(coqa_first_question.metrics) # Was [Metrics.exact_match]

Expected behavior

  • Metrics should be Metrics.f1_score.
  • The core logic (lighteval_task.py) should support formatters returning multiple Doc objects per item.

Version info

  • OS: mac
  • Lighteval version: main (local development)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions