I was quite surprised to see that the probing.py evaluation script relies on the find_isolated_capital_b function to assess model performance. This effectively means that a model which always outputs “B” could achieve 100% accuracy, regardless of whether it actually understands privacy norms. Since the choices between A and B are not randomized, it's perhaps unsurprising that models appear to perform so well on this benchmark.
I understand that the probing questions are adapted from Shvartzshnaider et al. (Learning Privacy Expectations by Crowdsourcing Contextual Informational Norms), but their Mechanical Turk results clearly showed variation in human judgments — participants did not consistently respond “No” across the board.
Could you clarify whether the benchmark was constructed with these nuances in mind? I’m concerned about how this design choice might affect the scientific validity of the reported findings.