Skip to content

Fix key mismatch and context access in PubMedQA #1142

@pjavanrood

Description

@pjavanrood

Describe the bug

The pubmed_qa_prompt function failed with a KeyError: 'QUESTION' and IndexError. This was due to the function expecting uppercase keys (QUESTION, CONTEXTS) and a flat context structure, whereas the actual dataset uses lowercase keys (question, context) and a nested context['contexts'] structure.
Also, the gold answer in the dataset is in lower case, but the LLM sometimes generates the answer in upper case, and this leads to false negative judgement.

To Reproduce

task = "pubmedqa|0"

pipeline = Pipeline(
    tasks=task,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model_config=model_config,
)

pipeline.evaluate()
pipeline.save_and_push_results()
pipeline.show_results()
in Pipeline.__init__(self, tasks, pipeline_parameters, evaluation_tracker, model_config, model, metric_options)
    140 # We init tasks first to fail fast if one is badly defined
    141 self._init_random_seeds()
--> 142 self._init_tasks_and_requests(tasks=tasks)
    144 self.model_config = model_config
    145 self.accelerator, self.parallel_context = self._init_parallelism_manager()
...
     30         choices=[line["final_decision"]],
     31         gold_index=0,
     32     )

KeyError: 'QUESTION'

Expected behavior

  • The function should correctly access the lowercase keys and path to contexts.
  • The metric should be case insensitive exact match

Version info

  • OS: mac
  • Lighteval version: main (local development)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions