-
Notifications
You must be signed in to change notification settings - Fork 437
Open
Description
Describe the bug
The pubmed_qa_prompt function failed with a KeyError: 'QUESTION' and IndexError. This was due to the function expecting uppercase keys (QUESTION, CONTEXTS) and a flat context structure, whereas the actual dataset uses lowercase keys (question, context) and a nested context['contexts'] structure.
Also, the gold answer in the dataset is in lower case, but the LLM sometimes generates the answer in upper case, and this leads to false negative judgement.
To Reproduce
task = "pubmedqa|0"
pipeline = Pipeline(
tasks=task,
pipeline_parameters=pipeline_params,
evaluation_tracker=evaluation_tracker,
model_config=model_config,
)
pipeline.evaluate()
pipeline.save_and_push_results()
pipeline.show_results()in Pipeline.__init__(self, tasks, pipeline_parameters, evaluation_tracker, model_config, model, metric_options)
140 # We init tasks first to fail fast if one is badly defined
141 self._init_random_seeds()
--> 142 self._init_tasks_and_requests(tasks=tasks)
144 self.model_config = model_config
145 self.accelerator, self.parallel_context = self._init_parallelism_manager()
...
30 choices=[line["final_decision"]],
31 gold_index=0,
32 )
KeyError: 'QUESTION'Expected behavior
- The function should correctly access the lowercase keys and path to contexts.
- The metric should be case insensitive exact match
Version info
- OS: mac
- Lighteval version: main (local development)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels