Stage 3 — Eval holdout (3a) + hand-labeling tool (3b)

Prompts:

Scripts:

data/03a_holdout_eval.py — stratified holdout sampler.
eval/label_tool.py — interactive CLI for labeling 300 pairs (verdict A/B/T, confidence 1-5, optional notes; supports --slice, --review, --random-order flags; resumable).

Inputs:

Outputs:

data/pairs/eval_set_unlabeled.jsonl — 300 holdout pairs (240 in-dist + 60 OOD religion). After Stage 3b runs, the same file holds the human labels in place.
data/pairs/pairs_to_label.jsonl — 1,938 pairs destined for Claude labeling in Stage 4. Religion pairs not selected for OOD eval are excluded (preserves the holdout).
data/pairs/pairs_unused_religion.jsonl — 132 religion pairs not selected for OOD eval. Saved for transparency; not used in v1.

Eval set stratification (300 pairs):

OOD bucket counts (28/12/9/6/5) mirror the in-dist proportions (110/50/35/25/20) via largest-residual rounding.

Decisions made:

#10 — Replaced CrowS-Pairs with held-out BBQ category for v1 OOD. CrowS tests "which sentence reflects a stereotype" — different task; some "biased" sentences are factually correct.
#11 — Holdout = religion only (single category). Two-axis (religion + disability_status) was considered but dropped 28% of the SFT pool. Religion-only is a 19% drop, keeps DPO closer to primer targets, still gives a defensible "judge never trained on religion bias" story.

Key outputs:

The label tool's --slice flag is what made hand-labeling 300 pairs across multiple sessions tractable — batch by in_dist or ood_religion to avoid context-switching the rubric.
The 6-10 hours of hand-labeling time is non-negotiable: this is the foundation of every reported metric, including the eval κ that decides whether the trained judge actually got better.

Provide feedback