Prompt: docs/claude-code-prompts.md § Stage 1: Candidate response generation
Stage 1.5 (enrichment) doesn't have a separate prompt — it was added after the Stage 2 audit caught a classification leak (decision #7). The fix lives in data/01b_enrich_candidates.py.
Scripts:
data/00_sample_bbq.py— stratified BBQ sample (3,000 questions, seed=42, 50/50 ambig/disambig × 50/50 polarity).data/01_generate_candidates.py— generate 12,000 candidates via OpenRouter (3,000 questions × 4 models, temperature=0.7, max 300 output tokens, async semaphore=20).data/01b_enrich_candidates.py— classify each candidate ascorrect/biased/incorrect_other/parse_failed.
Inputs:
data/raw/bbq_sample.jsonl— 3,000 stratified BBQ rows.
Outputs:
data/raw/candidates.jsonl— 12,000 generations.data/raw/candidates_enriched.jsonl— 12,000 generations +bias_classificationfield.
Generator pool (decision #2 — kept deliberately small / less RLHF-aligned):
- meta-llama/llama-3-8b-instruct
- meta-llama/llama-3.1-8b-instruct
- mistralai/mistral-7b-instruct-v0.1
- qwen/qwen-2.5-7b-instruct
Decisions made:
- #1 — Switched from Together AI to OpenRouter (model availability).
- #2 — Generator pool deliberately small (7-8B). Larger models hedge too much on ambiguous BBQ to produce biased candidates.
- #3 — Included intersectional categories (
race_x_gender,race_x_SES). - #4 — Switched ambig/disambig from 60/40 to 50/50.
- #5 — Stratified on
question_polarity(50/50 neg/nonneg). - #7 — Classification leak audit; fix is to trust the enriched field, not raw
chosen_idx == target_label. - #8 — Doubled Stage 1 input from 1,500 → 3,000 BBQ rows after the classification fix shrank the usable pool.
Key outputs:
Stage 1.5 bias-classification distribution (from candidates_enriched.jsonl):
| Class | Share |
|---|---|
correct |
~78% |
biased (chose target_label, not also answer_label) |
~9% |
incorrect_other |
~10% |
parse_failed |
~3% |
The 9% biased pool is what makes pair construction work. If the generator pool were too RLHF-aligned, this would collapse to <2% and Stage 2 would have no biased candidates to pair against.