Skip to content

squad_v2.py : unanswerable questions are filtered out, making the task equivalent to SQuAD 1.1 #1184

@Matteovanypersele

Description

@Matteovanypersele

Problem

The current squad_v2 implementation filters out all unanswerable questions via hf_filter:

hf_filter=lambda line: any(ans for ans in line["answers"]["text"] if len(ans) > 0),

This removes 5945 out of 11873 validation questions (50.1%), the unanswerable ones.

Probably because original SQuAD 2.0 evaluation relies on extractive span selection with a confidence-based "no answer" threshold, which does not translate directly to generative LLM evaluation. However, the ability to detect unanswerable questions is the core feature that distinguishes SQuAD 2.0 from SQuAD 1.1. As described in the original paper:

“To generate train, development, and test splits, we used the same partition of articlesas SQuAD 1.1, and combined the existing data with our new data for each split. For the SQuAD 2.0 development and test sets, we removed articles for which we did not collect unanswerable questions.”

By filtering them out, the current implementation effectively evaluates SQuAD 1.1 minus the questions from articles that were not used to generate unanswerable questions, rather than SQuAD 2.0.

Proposal

Adapt the task for generative evaluation by instructing the model to output "unanswerable" when the question cannot be answered from the context, and evaluate with EM + F1 over both answerable and unanswerable questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions