squad_v2.py : unanswerable questions are filtered out, making the task equivalent to SQuAD 1.1

Problem

The current squad_v2 implementation filters out all unanswerable questions via hf_filter:

hf_filter=lambda line: any(ans for ans in line["answers"]["text"] if len(ans) > 0),

This removes 5945 out of 11873 validation questions (50.1%), the unanswerable ones. 

Probably because original SQuAD 2.0 evaluation relies on extractive span selection with a confidence-based  "no answer" threshold, which does not translate directly to generative LLM evaluation. However, the ability to detect unanswerable questions is the core feature that distinguishes SQuAD 2.0 from SQuAD 1.1. As described in the original paper:

“To generate train, development, and test splits, we used the same partition of articlesas SQuAD 1.1, and combined the existing data with our new data for each split. For the SQuAD 2.0 development and test sets, we removed articles for which we did not collect unanswerable questions.”

By filtering them out, the current implementation effectively evaluates SQuAD 1.1 minus the questions from articles that were not used to generate unanswerable questions, rather than SQuAD 2.0.

Proposal

Adapt the task for generative evaluation by instructing the model to output "unanswerable" when the question cannot be answered from the context, and evaluate with EM + F1 over both answerable and unanswerable questions.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

squad_v2.py : unanswerable questions are filtered out, making the task equivalent to SQuAD 1.1 #1184

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

squad_v2.py : unanswerable questions are filtered out, making the task equivalent to SQuAD 1.1 #1184

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions