How Is It Scientifically Valid to Have Ground Truth Answers to Be "B" ONLY?

I was quite surprised to see that the `probing.py` evaluation script relies on the `find_isolated_capital_b` function to assess model performance. This effectively means that a model which always outputs “B” could achieve 100% accuracy, regardless of whether it actually understands privacy norms. Since the choices between A and B are not randomized, it's perhaps unsurprising that models appear to perform so well on this benchmark.

I understand that the probing questions are adapted from Shvartzshnaider et al. (*Learning Privacy Expectations by Crowdsourcing Contextual Informational Norms*), but their Mechanical Turk results clearly showed variation in human judgments — participants did not consistently respond “No” across the board.

Could you clarify whether the benchmark was constructed with these nuances in mind? I’m concerned about how this design choice might affect the scientific validity of the reported findings.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How Is It Scientifically Valid to Have Ground Truth Answers to Be "B" ONLY? #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How Is It Scientifically Valid to Have Ground Truth Answers to Be "B" ONLY? #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions