paiml · noahgift · May 22, 2026 · May 22, 2026
diff --git a/.gitignore b/.gitignore
@@ -31,3 +31,6 @@ mutants.out.old/
 
 # mdBook output (book/src is committed; book/book is generated)
 /book/book/
+
+# Claude Code runtime scheduling artifacts (not project content)
+/.claude/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -255,7 +255,7 @@ is the project blueprint. Major behavioral changes update both:
 1. The contract's `status_history` (factual, machine-readable record).
 2. The spec markdown (narrative + milestone roll-up).
 
-Status as of v1.32.0 (2026-05-19): M0–M280 all SHIPPED; corpus complete; CCPA work SUSPENDED at M280 pending aprender#1789
+Status as of v1.32.0 (2026-05-22): M0–M296 all SHIPPED; corpus complete; V1_004 still open; project in clean three-month-break handoff state at M296 (see evidence/phase-6/m296-three-month-break-closeout-2026-05-22.md)
 (30/30 API + 4 OS); 18/18 gates registered (16 ACTIVE_RUNTIME-track + 2 PROPOSED at v1.29.0 — CCPA-017 project-scale + CCPA-018 arena recovery-rate, both await operator dispatch); companion ↔ aprender round-trip
 mechanically guarded. **M32d numerical-parity FUNCTIONALLY DISCHARGED**
 2026-05-02 (aprender PR #1228 squash 5235aaeb9): output transition

diff --git a/README.md b/README.md
@@ -40,7 +40,7 @@ mdbook build book/ && open book/book/index.html
 | Canonical corpus aggregate | **1.0000** | 30 | M150 — `fixtures/canonical/measured-parity.json` |
 | Contract version | v1.32.0 | — | aprender-side authoritative |
 | Falsification gates | 20/20 registered | — | 16 ACTIVE_RUNTIME + 4 PROPOSED |
-| Sub-milestones shipped | M0–M280 all SHIPPED | — | continuous since 2026-04-26 |
+| Sub-milestones shipped | M0–M296 all SHIPPED | — | continuous since 2026-04-26 |
 
 **Honest framing**: at function-scale, the two systems are functionally interchangeable. At project-scale, the static-fixture approach is **Popperian-falsified** as a project-scale predictor. Contract at v1.32.0 (aprender-side authoritative). The full 3-axis breakdown is in [`docs/specifications/completeness-assessment.md`](docs/specifications/completeness-assessment.md).
 

diff --git a/docs/specifications/claude-code-parity-apr-poc.md b/docs/specifications/claude-code-parity-apr-poc.md
diff --git a/docs/specifications/completeness-assessment.md b/docs/specifications/completeness-assessment.md
@@ -1,10 +1,10 @@
-# Completeness assessment (2026-05-18, post-M266)
+# Completeness assessment (2026-05-22, post-M296)
 
 [Top spec: claude-code-parity-apr-poc.md](claude-code-parity-apr-poc.md) | [Risks](risks.md) | [Status snapshots](status-snapshots.md) | [Axis-2 closure plan](axis-2-closure-plan.md) | [Design audit](design-audit.md)
 
 Authored at M111 in response to operator's "what percentage complete is it" question. Foregrounds the M2.3 OOS-rescope's unaddressed differential-test gap that the headline "30/30 fixtures aggregate=1.0000" claim obscured. **M113 follow-up**: see [axis-2-closure-plan.md](axis-2-closure-plan.md) for the 5-idea brainstorm and recommended (2)→(3) sequence to close Axis 2 from ~30% to ~70% (CLI subprocess instrumentation + SWE-bench differential evaluation). **M118 follow-up**: [deepclaude](https://github.com/aattaran/deepclaude) prior art positively discharges idea (1)'s technical-feasibility doubt; cost re-estimated from ~1-2 weeks to ~3-7 days. **M136-M140 progress**: axis-2-closure-plan idea (2) MACHINERY is now SHIPPED end-to-end (M136 capture binary + M137 differ + M139 corpus + gate + M140 contract bump in flight). **M150-M154 progress** *(M155 update)*: Phase 3 outcome-parity path SHIPPED end-to-end on a public benchmark (MultiPL-E-Rust) — real claude + real apr code (Qwen2.5-Coder-1.5B) bilateral bench produced **outcome parity = 1.0000 (5/5)**, **structural similarity = 0.5201 line-set Jaccard**, and **test-survival = 1.0000 (10/10 cross-swaps)** — proving the two systems are functionally interchangeable on this POC corpus. **Axis 2 score now moves from ~50% → ~70%**: meter machinery for OS-level differential testing is built AND real-binary execution evidence is shipped. See § "Are we at parity with Claude Code?" below for the M155 honest current state.
 
-## Completeness assessment (2026-05-18, post-M266)
+## Completeness assessment (2026-05-22, post-M296)
 
 The headline numbers (M0–M244 SHIPPED, 19/19 gates registered (16 ACTIVE_RUNTIME-track + 3 PROPOSED at companion v1.31.0: CCPA-017 project-scale, CCPA-018 arena recovery-rate, CCPA-019 calibration-required-before-verdict), 30/30 API-level fixtures aggregate=1.0000, 21-fixture MultiPL-E-Rust outcome-parity corpus + 2 validation layers (M172 structural + M174 deep) + pre-commit hook integration (M176), 5-fixture project-scale corpus at `fixtures/project-scale/` (M182 + M188 P4.x), 15-fixture calibration-and-scale corpus at `fixtures/calibration-and-scale/` (M242), Phase 5 Arena harness end-to-end SHIPPED (M196-M210), Branch B harness rework SHIPPED (M234 + M236-M244), companion contract v1.31.0 — aprender v1.30.0 awaiting v1.31.0 catch-up via aprender#1778) are technically true but require the Branch B caveat below. Honest 3-axis breakdown:
 

diff --git a/docs/specifications/milestones-m101-m111.md b/docs/specifications/milestones-m101-m111.md
@@ -118,6 +118,8 @@ criteria authored at M101:
 
 | ID | Deliverable | Squash | PR |
 |----|-------------|--------|-----|
+| **M296** | **Three-month operator-directed break closeout** — V1_004 chain wraps with V1_004 still open but empirically narrowed. Session shipped 12 PRs (7 CCPA + 5 aprender) spanning M286-M295 + this M296 closeout. **The story**: M280 SUSPENSION un-blocked at M286 when aprender#1832 shipped M32d MoE KV cache (19× speedup). M287 greedy baseline confirmed Qwen3-Coder-30B-A3B `driver_error` pattern. M288-M290 5 aprender PRs (sampling + EOS + clean_chat_output + few-shot in CODE_SYSTEM_PROMPT) fixed three infrastructure gaps. M291 sub-bench B pattern shifted from `driver_error` to `oracle_failed_after_max_turns` with `tool_use_count: 0` — revealing the agent-quality bottleneck. M292 shipped `ArenaOutcome::AgentTextLoop` detector + 7 tests as Gap 3 closure. M293 wired `PHASE6_MAX_CONSECUTIVE_TEXT_TURNS` env-var. M294 scoped + dispatched the non-Coder Qwen3-30B-A3B-Instruct-2507 A/B; smoke confirmed clean tool_call JSON emission in 20 tokens. M295 shipped professional README + 28-chapter mdBook + GitHub Pages auto-deploy (now live at https://paiml.github.io/claude-code-parity-apr/). **Bench-level partial refutation**: F1 of the non-Coder Instruct bench produced `driver_error` at turn 8, tool_use_count=0, 8 Markdown turns — same pattern as Coder family. The smoke-vs-bench divergence surfaces a second-order constraint: apr code's multi-turn prompt context (rendered history with previous turn's Markdown + "### Continue:" suffix) self-recursively reinforces the Markdown distribution even on a finetune that emits tool_call JSON in 1-shot smoke. **Three resumption paths scoped** in `evidence/phase-6/m296-three-month-break-closeout-2026-05-22.md`: (a) investigate `render_history` + per-turn prompt construction, (b) post-decode Markdown→tool_call parser in apr code (unlocks Qwen-Coder family for V1_004 as written), (c) V1_005 against different model class on Lambda Labs A100/H100 (Llama-3.3-70B, DeepSeek-V3, Qwen3-32B-Instruct dense). **Project handoff state**: no in-flight benches, no orphan processes, 5 partial evidence archives captured (`evidence/under-contract-*partial-*`), book deployed, M-counter bumped 5 surfaces (README, CONTRIBUTING, top spec, status-snapshots, milestones). No new code in crates/, no schema bump, no contract YAML bump at M296. M-counter M280 → M296 (15 substantive M-rows across V1_004 chain + book + closeout). | `(this PR)` | this PR |
+| **M286-M295** | **The V1_004 chain (12-PR session, 2026-05-20 through 2026-05-22)** — full empirical isolation of the Qwen-Coder finetune-distribution variable. Per-PR narrative captured in `evidence/phase-6/m296-three-month-break-closeout-2026-05-22.md`. Cross-references: `evidence/phase-6/m32d-shipped-2026-05-20.md` (M286), `m32d-bench-pattern-2026-05-20.md` (M287), `v1004-3knob-dispatch-recipe-2026-05-20.md` (M288), `v1004-3knob-plumbing-shipped-2026-05-20.md` (M289), `v1004-followup-snapshot-2026-05-20.md` (M290), `v1004-sub-bench-b-pattern-shift-2026-05-21.md` (M291), `v1004-agent-text-loop-detector-2026-05-21.md` (M292), CCPA#259/260/261/262/263 + aprender#1832/1837/1842/1844/1846/1849/1852/1853 (M286-M295 PR trail). | (rolled-up) | (multi-PR) |
 | **M280** | **Phase 6 closeout — 1.5B zero-baseline harness validation + CCPA project SUSPENSION declaration pending aprender#1789** — operator-directed closure (verbatim directive recorded inline in evidence writeup) after the M280 control-mode dispatch (PHASE6_COMPLIANCE_ENFORCED=0, fixture 1-2 confirmed in flight) replicated the M270 treatment pattern: student 0/N OraclePassed regardless of compliance regime. **The compliance_cost_ratio is mathematically `0/0 = undefined`** and **semantically means "contract compliance costs nothing if the model already can't write code"** — a successful test of the Phase 6 machinery, not a failure. The 1.5B Qwen2.5-Coder is below the floor of testability for under-contract dispatch; both treatment + control regimes produce 0% student pass rate. **Three deliverables**: (1) **`evidence/phase-6/1.5b-calibration-run.md`** (~110 LOC) at operator-specified path — official "Harness Validation / 1.5B Zero Baseline" writeup. Sections: headline conclusion (harness works; ratio undefined; below testability floor); the two dispatches (M270 treatment + M280 control, with M270 numbers final + M280 in-flight at ship time); **six "what the harness correctly handled" observations** (M266 schema drift, M268 oracle preflight, M262 pre-warm, M264 dual-path BENCH_BIN, M276 compliance toggle, 20-turn exhaustion + driver_error handling all clean); **three "what we cannot learn from 1.5B"** (agent-quality differential = 0; no recovery exercise on student; CCPA-020 vacuously satisfied); **teacher-side stochasticity caveat** (claude 182→216 turns on F1, 36→154 turns on F2 between treatment + control — `PHASE6_COMPLIANCE_ENFORCED` does NOT affect teacher dispatch so this is pure inference-time noise, not signal); **verbatim operator interpretation** quoted; **CCPA project status post-M280: OFFICIALLY SUSPENDED** pending aprender#1789. (2) **Suspension markers on 4 visible surfaces**: top spec § Status (added the SUSPENDED clause + cross-ref to evidence writeup); README.md At-a-glance table (new row: "CCPA work status: SUSPENDED at M280"); CONTRIBUTING.md status-line (suffix "; CCPA work SUSPENDED at M280 pending aprender#1789"); phase-6-results-and-next-steps.md (M278) — status header flipped to "OFFICIALLY SUSPENDED" + operator-dispatchable section opened with post-M280 status note (Step 1 done, Steps 2-3 deferred). (3) This milestones-m101-m111.md M280 row + status-snapshots.md M280 entry + Run 1 history extension to M280. **What this M-row does NOT do**: does NOT abandon the project (the meter is mechanically complete + publication-ready per [phase-6-results-and-next-steps.md](phase-6-results-and-next-steps.md)); does NOT preclude un-suspension after aprender#1789 ships; does NOT stop the in-flight M280 control bench (operator directive: "Let the control bench finish" — the M280 writeup will get an addendum with final control numbers once it lands). **Why suspend now**: per operator, "You have extracted every drop of useful signal the 1.5B model can give you. The harness works. The baseline is zero. The only way to measure a meaningful compliance_cost_ratio (where the control > 0 and the treatment < control) is to use a model capable of actually solving the problems." Further substantive Phase 6 work is unblock-able only by aprender#1789 (deep Qwen3-MoE F32 routing fix). **The session ship summary**: M0-M280 SHIPPED on companion; 20/20 contract gates registered at v1.32.0; 25 spec files; 30+ fixtures across 4 corpora (canonical / regression / calibration-and-scale / under-contract); 5 aprender PRs (4 merged, #1789 OPEN deep architectural). No new code in crates/, no schema bump, no contract YAML bump at M280. M-counter bumped M278 → M280 (M279 was M278-row mechanical refresh via `f643183`). **Spec file count unchanged at 25**; **evidence file count +1** (`evidence/phase-6/1.5b-calibration-run.md`). | `f69fe23` | #248 |
 | **M278** | **Phase 6 results-and-next-steps synthesizing doc** — new spec file `docs/specifications/phase-6-results-and-next-steps.md` (124 lines, well within ≤500 cap) authored as the canonical publication-ready synthesis of the Phase 6 arc + the honest follow-up agenda. **Sections**: (1) **Executive summary** — one paragraph capturing the operator-directive M250 framing through the M276 control-mode mechanism. (2) **What was measured cleanly** — three substantive findings: turn-cost ratio (~13-15×), recovery rate (35%, mechanism falsifier NOT triggered), bench machinery soundness (P6.1-P6.6 ran end-to-end against new model + new corpus + post-M266 fixed schema without harness bugs). (3) **What was NOT measured cleanly** — three honest gaps: apples-to-apples cost ratio (cross-corpus, not same-corpus; M276 control-mode mechanism ready), non-vacuous CCPA-020 evidence (teacher one-shot bypass + student-side 0/20 → invariant vacuously satisfied), student-side under-contract data (1.5B Qwen too unstable). (4) **Operator-dispatchable next steps in priority order**: Step 1 cheap-now (`PHASE6_COMPLIANCE_ENFORCED=0 bash scripts/phase-6-bench.sh` produces clean falsifier evidence in ~7hr); Step 2 model-acquisition (download Qwen2.5-Coder-7B for non-vacuous CCPA-020 evidence); Step 3 await-aprender#1789 (Qwen3-Coder-30B-MoE under-contract = full Axis 2/3 closure). (5) **Cross-references** — every relevant spec file + evidence file + aprender PR. (6) **Publication readiness** — explicit statement that the honest-disclosure form is canonical for publication. **Why M278 is substantive (not mechanical)**: the synthesis didn't exist anywhere before; it pulls together M250-M276 across plan / design-audit / evidence / scripts / contract YAML into a single publication-ready entry point. Operator + future maintainers + a publication audience can read ONE doc to understand the Phase 6 arc + its honest limits + the path forward. **No new code in crates/**, no schema bump, no contract YAML bump. M-counter bumped M276 → M278 (M277 was M276-row mechanical refresh via `00c9c67`). **Spec file count bumped 24 → 25** — new phase-6-results-and-next-steps.md (124 lines). | `f351d8e` | #247 |
 | **M276** | **Phase 6 bench control mode + analyzer apples-to-apples ratio** — `PHASE6_COMPLIANCE_ENFORCED` env-var toggle on `scripts/phase-6-bench.sh` lets the operator dispatch the SAME corpus + model + budgets WITHOUT `--compliance-enforced` for the clean falsifier control baseline (per [phase-6-design-audit.md § 4](phase-6-design-audit.md)). **Two-file change**: (1) **`scripts/phase-6-bench.sh`** — new `PHASE6_COMPLIANCE_ENFORCED="${PHASE6_COMPLIANCE_ENFORCED:-1}"` env var: `=1` (default) writes treatment evidence to `evidence/under-contract/` + passes `--compliance-enforced --max-consecutive-compliance-failures=N` to `ccpa-arena-bench`; `=0` writes control evidence to `evidence/under-contract-control/` + DOES NOT pass the compliance flags (apr code runs raw against the same fixtures). Header echo prints the active mode label (`under-contract (treatment, --compliance-enforced active)` vs `control baseline (--compliance-enforced DISABLED for apples-to-apples)`). New `bench_mode` + `compliance_enforced` fields written into `scores.json` so each evidence set is self-describing. Also fixed the M274-noted `scores.json::corpus` field copy-paste bug: `fixtures/calibration-and-scale/` → `fixtures/under-contract/`. (2) **`scripts/analyze-under-contract-scores.sh`** — analyzer's `compliance_cost_ratio` section now prefers apples-to-apples (treatment / control) when `evidence/under-contract-control/scores.json` exists, computes both teacher AND student ratios, references the design-audit § 4 falsifier; falls back to cross-corpus M260 comparison with an explicit "NOT apples-to-apples" warning + a one-line dispatch hint pointing at `PHASE6_COMPLIANCE_ENFORCED=0`. **Empirically verified**: `bash -n scripts/phase-6-bench.sh` clean; control-mode dry-run shows header label correct; analyzer in cross-corpus mode prints the warning + dispatch hint (as expected, since no control evidence exists yet). **Path forward to clean falsifier**: operator dispatches `PHASE6_COMPLIANCE_ENFORCED=0 bash scripts/phase-6-bench.sh` (~7hr wall same as treatment) → produces `evidence/under-contract-control/scores.json` → analyzer's next run reads BOTH and prints the clean ratio. **No new code in crates/**, no schema bump on `ccpa_trace`, no contract YAML bump at M276. M-counter bumped M274 → M276 (M275 was M274-row mechanical refresh via `7e27726`). **Spec file count unchanged**: 24. **Why M276 is substantive (not mechanical)**: introduces a new operational mode (control vs treatment) that's the canonical falsifier methodology per design-audit § 4; bench-script API surface grows by 1 env var + 2 new scores.json fields. | `b406a38` | #246 |

diff --git a/docs/specifications/status-snapshots.md b/docs/specifications/status-snapshots.md