Checked-in benchmark artifacts for skills/eval regression tracking.
- GFQL expansion eval (2026-03-21, Claude):
- Data:
data/2026-03-21-gfql-expansion - Report:
reports/2026-03-21-gfql-expansion.md - Skills ON: 82% pass (27/33), avg score 0.95
- Includes functional execution testing (4/7 cases produce correct executable code)
- Data:
- REST phase2 full sweep (2026-03-07):
- Data:
data/2026-03-07-rest-phase2-full-sweep - Report:
reports/2026-03-07-rest-phase2-full-sweep.md - Skills ON: 90.9% pass (30/33), Skills OFF: 27.3% pass (9/33), +63.6pp delta
- Data:
- REST gapfix final sweep (2026-03-07):
- Data:
data/2026-03-07-rest-gapfix-final-sweep - Report:
reports/2026-03-07-rest-gapfix-final-sweep.md - Skills ON: 92.3% pass (24/26), Skills OFF: 30.8% pass (8/26), +61.5pp delta
- Data:
- REST skills optimization sweep (2026-03-07):
- Data:
data/2026-03-07-rest-skills-optimization-sweep - Report:
reports/2026-03-07-rest-skills-optimization-sweep.md - Skills ON: 95% pass (19/20), Skills OFF: 35% pass (7/20), +60pp delta
- Data:
- Baseline isolation sweep (2026-03-01):
- Data:
data/2026-03-01-baseline-isolation-sweep - Report:
reports/2026-03-01-baseline-isolation-sweep.md - Skills ON: 91% pass (51/56), Skills OFF: 52% pass (29/56), +39pp delta
- Data:
- Post-cleanup full sweep (2026-02-23, had baseline contamination):
- Data:
data/2026-02-23-postcleanup-fullsweep - Report:
reports/2026-02-23-postcleanup-fullsweep.md
- Data:
- Codex effort A/B (
gpt-5.3-codex, high vs medium):- Data:
data/2026-02-23-codex-effort-ab - Report:
reports/2026-02-23-codex-effort-ab.md
- Data:
data/*/combined_metrics.json: normalized aggregate metrics (public-safe; source paths redacted)reports/*.md: rendered scorecards derived from benchmark runs (public-safe)data/*scenario-coverage*.json: scenario metadata coverage audits
- Keep only benchmark packs tied to current docs claims and release decisions.
- Avoid checking in experiment packs that do not materially improve quality/speed claims.
- Keep raw run artifacts local/private (
rows.jsonl,manifest.json,otel_ids.json, run logs, and traces).