-
Notifications
You must be signed in to change notification settings - Fork 437
Description
Problem
The current squad_v2 implementation filters out all unanswerable questions via hf_filter:
hf_filter=lambda line: any(ans for ans in line["answers"]["text"] if len(ans) > 0),
This removes 5945 out of 11873 validation questions (50.1%), the unanswerable ones.
Probably because original SQuAD 2.0 evaluation relies on extractive span selection with a confidence-based "no answer" threshold, which does not translate directly to generative LLM evaluation. However, the ability to detect unanswerable questions is the core feature that distinguishes SQuAD 2.0 from SQuAD 1.1. As described in the original paper:
“To generate train, development, and test splits, we used the same partition of articlesas SQuAD 1.1, and combined the existing data with our new data for each split. For the SQuAD 2.0 development and test sets, we removed articles for which we did not collect unanswerable questions.”
By filtering them out, the current implementation effectively evaluates SQuAD 1.1 minus the questions from articles that were not used to generate unanswerable questions, rather than SQuAD 2.0.
Proposal
Adapt the task for generative evaluation by instructing the model to output "unanswerable" when the question cannot be answered from the context, and evaluate with EM + F1 over both answerable and unanswerable questions.