Skip to content

How Is It Scientifically Valid to Have Ground Truth Answers to Be "B" ONLY? #3

@Sasha-Cui

Description

@Sasha-Cui

I was quite surprised to see that the probing.py evaluation script relies on the find_isolated_capital_b function to assess model performance. This effectively means that a model which always outputs “B” could achieve 100% accuracy, regardless of whether it actually understands privacy norms. Since the choices between A and B are not randomized, it's perhaps unsurprising that models appear to perform so well on this benchmark.

I understand that the probing questions are adapted from Shvartzshnaider et al. (Learning Privacy Expectations by Crowdsourcing Contextual Informational Norms), but their Mechanical Turk results clearly showed variation in human judgments — participants did not consistently respond “No” across the board.

Could you clarify whether the benchmark was constructed with these nuances in mind? I’m concerned about how this design choice might affect the scientific validity of the reported findings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions