Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,6 @@ mutants.out.old/

# mdBook output (book/src is committed; book/book is generated)
/book/book/

# Claude Code runtime scheduling artifacts (not project content)
/.claude/
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -255,7 +255,7 @@ is the project blueprint. Major behavioral changes update both:
1. The contract's `status_history` (factual, machine-readable record).
2. The spec markdown (narrative + milestone roll-up).

Status as of v1.32.0 (2026-05-19): M0–M280 all SHIPPED; corpus complete; CCPA work SUSPENDED at M280 pending aprender#1789
Status as of v1.32.0 (2026-05-22): M0–M296 all SHIPPED; corpus complete; V1_004 still open; project in clean three-month-break handoff state at M296 (see evidence/phase-6/m296-three-month-break-closeout-2026-05-22.md)
(30/30 API + 4 OS); 18/18 gates registered (16 ACTIVE_RUNTIME-track + 2 PROPOSED at v1.29.0 — CCPA-017 project-scale + CCPA-018 arena recovery-rate, both await operator dispatch); companion ↔ aprender round-trip
mechanically guarded. **M32d numerical-parity FUNCTIONALLY DISCHARGED**
2026-05-02 (aprender PR #1228 squash 5235aaeb9): output transition
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ mdbook build book/ && open book/book/index.html
| Canonical corpus aggregate | **1.0000** | 30 | M150 — `fixtures/canonical/measured-parity.json` |
| Contract version | v1.32.0 | — | aprender-side authoritative |
| Falsification gates | 20/20 registered | — | 16 ACTIVE_RUNTIME + 4 PROPOSED |
| Sub-milestones shipped | M0–M280 all SHIPPED | — | continuous since 2026-04-26 |
| Sub-milestones shipped | M0–M296 all SHIPPED | — | continuous since 2026-04-26 |

**Honest framing**: at function-scale, the two systems are functionally interchangeable. At project-scale, the static-fixture approach is **Popperian-falsified** as a project-scale predictor. Contract at v1.32.0 (aprender-side authoritative). The full 3-axis breakdown is in [`docs/specifications/completeness-assessment.md`](docs/specifications/completeness-assessment.md).

Expand Down
4 changes: 2 additions & 2 deletions docs/specifications/claude-code-parity-apr-poc.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/specifications/completeness-assessment.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Completeness assessment (2026-05-18, post-M266)
# Completeness assessment (2026-05-22, post-M296)

[Top spec: claude-code-parity-apr-poc.md](claude-code-parity-apr-poc.md) | [Risks](risks.md) | [Status snapshots](status-snapshots.md) | [Axis-2 closure plan](axis-2-closure-plan.md) | [Design audit](design-audit.md)

Authored at M111 in response to operator's "what percentage complete is it" question. Foregrounds the M2.3 OOS-rescope's unaddressed differential-test gap that the headline "30/30 fixtures aggregate=1.0000" claim obscured. **M113 follow-up**: see [axis-2-closure-plan.md](axis-2-closure-plan.md) for the 5-idea brainstorm and recommended (2)→(3) sequence to close Axis 2 from ~30% to ~70% (CLI subprocess instrumentation + SWE-bench differential evaluation). **M118 follow-up**: [deepclaude](https://github.com/aattaran/deepclaude) prior art positively discharges idea (1)'s technical-feasibility doubt; cost re-estimated from ~1-2 weeks to ~3-7 days. **M136-M140 progress**: axis-2-closure-plan idea (2) MACHINERY is now SHIPPED end-to-end (M136 capture binary + M137 differ + M139 corpus + gate + M140 contract bump in flight). **M150-M154 progress** *(M155 update)*: Phase 3 outcome-parity path SHIPPED end-to-end on a public benchmark (MultiPL-E-Rust) — real claude + real apr code (Qwen2.5-Coder-1.5B) bilateral bench produced **outcome parity = 1.0000 (5/5)**, **structural similarity = 0.5201 line-set Jaccard**, and **test-survival = 1.0000 (10/10 cross-swaps)** — proving the two systems are functionally interchangeable on this POC corpus. **Axis 2 score now moves from ~50% → ~70%**: meter machinery for OS-level differential testing is built AND real-binary execution evidence is shipped. See § "Are we at parity with Claude Code?" below for the M155 honest current state.

## Completeness assessment (2026-05-18, post-M266)
## Completeness assessment (2026-05-22, post-M296)

The headline numbers (M0–M244 SHIPPED, 19/19 gates registered (16 ACTIVE_RUNTIME-track + 3 PROPOSED at companion v1.31.0: CCPA-017 project-scale, CCPA-018 arena recovery-rate, CCPA-019 calibration-required-before-verdict), 30/30 API-level fixtures aggregate=1.0000, 21-fixture MultiPL-E-Rust outcome-parity corpus + 2 validation layers (M172 structural + M174 deep) + pre-commit hook integration (M176), 5-fixture project-scale corpus at `fixtures/project-scale/` (M182 + M188 P4.x), 15-fixture calibration-and-scale corpus at `fixtures/calibration-and-scale/` (M242), Phase 5 Arena harness end-to-end SHIPPED (M196-M210), Branch B harness rework SHIPPED (M234 + M236-M244), companion contract v1.31.0 — aprender v1.30.0 awaiting v1.31.0 catch-up via aprender#1778) are technically true but require the Branch B caveat below. Honest 3-axis breakdown:

Expand Down
2 changes: 2 additions & 0 deletions docs/specifications/milestones-m101-m111.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,8 @@ criteria authored at M101:

| ID | Deliverable | Squash | PR |
|----|-------------|--------|-----|
| **M296** | **Three-month operator-directed break closeout** — V1_004 chain wraps with V1_004 still open but empirically narrowed. Session shipped 12 PRs (7 CCPA + 5 aprender) spanning M286-M295 + this M296 closeout. **The story**: M280 SUSPENSION un-blocked at M286 when aprender#1832 shipped M32d MoE KV cache (19× speedup). M287 greedy baseline confirmed Qwen3-Coder-30B-A3B `driver_error` pattern. M288-M290 5 aprender PRs (sampling + EOS + clean_chat_output + few-shot in CODE_SYSTEM_PROMPT) fixed three infrastructure gaps. M291 sub-bench B pattern shifted from `driver_error` to `oracle_failed_after_max_turns` with `tool_use_count: 0` — revealing the agent-quality bottleneck. M292 shipped `ArenaOutcome::AgentTextLoop` detector + 7 tests as Gap 3 closure. M293 wired `PHASE6_MAX_CONSECUTIVE_TEXT_TURNS` env-var. M294 scoped + dispatched the non-Coder Qwen3-30B-A3B-Instruct-2507 A/B; smoke confirmed clean tool_call JSON emission in 20 tokens. M295 shipped professional README + 28-chapter mdBook + GitHub Pages auto-deploy (now live at https://paiml.github.io/claude-code-parity-apr/). **Bench-level partial refutation**: F1 of the non-Coder Instruct bench produced `driver_error` at turn 8, tool_use_count=0, 8 Markdown turns — same pattern as Coder family. The smoke-vs-bench divergence surfaces a second-order constraint: apr code's multi-turn prompt context (rendered history with previous turn's Markdown + "### Continue:" suffix) self-recursively reinforces the Markdown distribution even on a finetune that emits tool_call JSON in 1-shot smoke. **Three resumption paths scoped** in `evidence/phase-6/m296-three-month-break-closeout-2026-05-22.md`: (a) investigate `render_history` + per-turn prompt construction, (b) post-decode Markdown→tool_call parser in apr code (unlocks Qwen-Coder family for V1_004 as written), (c) V1_005 against different model class on Lambda Labs A100/H100 (Llama-3.3-70B, DeepSeek-V3, Qwen3-32B-Instruct dense). **Project handoff state**: no in-flight benches, no orphan processes, 5 partial evidence archives captured (`evidence/under-contract-*partial-*`), book deployed, M-counter bumped 5 surfaces (README, CONTRIBUTING, top spec, status-snapshots, milestones). No new code in crates/, no schema bump, no contract YAML bump at M296. M-counter M280 → M296 (15 substantive M-rows across V1_004 chain + book + closeout). | `(this PR)` | this PR |
| **M286-M295** | **The V1_004 chain (12-PR session, 2026-05-20 through 2026-05-22)** — full empirical isolation of the Qwen-Coder finetune-distribution variable. Per-PR narrative captured in `evidence/phase-6/m296-three-month-break-closeout-2026-05-22.md`. Cross-references: `evidence/phase-6/m32d-shipped-2026-05-20.md` (M286), `m32d-bench-pattern-2026-05-20.md` (M287), `v1004-3knob-dispatch-recipe-2026-05-20.md` (M288), `v1004-3knob-plumbing-shipped-2026-05-20.md` (M289), `v1004-followup-snapshot-2026-05-20.md` (M290), `v1004-sub-bench-b-pattern-shift-2026-05-21.md` (M291), `v1004-agent-text-loop-detector-2026-05-21.md` (M292), CCPA#259/260/261/262/263 + aprender#1832/1837/1842/1844/1846/1849/1852/1853 (M286-M295 PR trail). | (rolled-up) | (multi-PR) |
| **M280** | **Phase 6 closeout — 1.5B zero-baseline harness validation + CCPA project SUSPENSION declaration pending aprender#1789** — operator-directed closure (verbatim directive recorded inline in evidence writeup) after the M280 control-mode dispatch (PHASE6_COMPLIANCE_ENFORCED=0, fixture 1-2 confirmed in flight) replicated the M270 treatment pattern: student 0/N OraclePassed regardless of compliance regime. **The compliance_cost_ratio is mathematically `0/0 = undefined`** and **semantically means "contract compliance costs nothing if the model already can't write code"** — a successful test of the Phase 6 machinery, not a failure. The 1.5B Qwen2.5-Coder is below the floor of testability for under-contract dispatch; both treatment + control regimes produce 0% student pass rate. **Three deliverables**: (1) **`evidence/phase-6/1.5b-calibration-run.md`** (~110 LOC) at operator-specified path — official "Harness Validation / 1.5B Zero Baseline" writeup. Sections: headline conclusion (harness works; ratio undefined; below testability floor); the two dispatches (M270 treatment + M280 control, with M270 numbers final + M280 in-flight at ship time); **six "what the harness correctly handled" observations** (M266 schema drift, M268 oracle preflight, M262 pre-warm, M264 dual-path BENCH_BIN, M276 compliance toggle, 20-turn exhaustion + driver_error handling all clean); **three "what we cannot learn from 1.5B"** (agent-quality differential = 0; no recovery exercise on student; CCPA-020 vacuously satisfied); **teacher-side stochasticity caveat** (claude 182→216 turns on F1, 36→154 turns on F2 between treatment + control — `PHASE6_COMPLIANCE_ENFORCED` does NOT affect teacher dispatch so this is pure inference-time noise, not signal); **verbatim operator interpretation** quoted; **CCPA project status post-M280: OFFICIALLY SUSPENDED** pending aprender#1789. (2) **Suspension markers on 4 visible surfaces**: top spec § Status (added the SUSPENDED clause + cross-ref to evidence writeup); README.md At-a-glance table (new row: "CCPA work status: SUSPENDED at M280"); CONTRIBUTING.md status-line (suffix "; CCPA work SUSPENDED at M280 pending aprender#1789"); phase-6-results-and-next-steps.md (M278) — status header flipped to "OFFICIALLY SUSPENDED" + operator-dispatchable section opened with post-M280 status note (Step 1 done, Steps 2-3 deferred). (3) This milestones-m101-m111.md M280 row + status-snapshots.md M280 entry + Run 1 history extension to M280. **What this M-row does NOT do**: does NOT abandon the project (the meter is mechanically complete + publication-ready per [phase-6-results-and-next-steps.md](phase-6-results-and-next-steps.md)); does NOT preclude un-suspension after aprender#1789 ships; does NOT stop the in-flight M280 control bench (operator directive: "Let the control bench finish" — the M280 writeup will get an addendum with final control numbers once it lands). **Why suspend now**: per operator, "You have extracted every drop of useful signal the 1.5B model can give you. The harness works. The baseline is zero. The only way to measure a meaningful compliance_cost_ratio (where the control > 0 and the treatment < control) is to use a model capable of actually solving the problems." Further substantive Phase 6 work is unblock-able only by aprender#1789 (deep Qwen3-MoE F32 routing fix). **The session ship summary**: M0-M280 SHIPPED on companion; 20/20 contract gates registered at v1.32.0; 25 spec files; 30+ fixtures across 4 corpora (canonical / regression / calibration-and-scale / under-contract); 5 aprender PRs (4 merged, #1789 OPEN deep architectural). No new code in crates/, no schema bump, no contract YAML bump at M280. M-counter bumped M278 → M280 (M279 was M278-row mechanical refresh via `f643183`). **Spec file count unchanged at 25**; **evidence file count +1** (`evidence/phase-6/1.5b-calibration-run.md`). | `f69fe23` | #248 |
| **M278** | **Phase 6 results-and-next-steps synthesizing doc** — new spec file `docs/specifications/phase-6-results-and-next-steps.md` (124 lines, well within ≤500 cap) authored as the canonical publication-ready synthesis of the Phase 6 arc + the honest follow-up agenda. **Sections**: (1) **Executive summary** — one paragraph capturing the operator-directive M250 framing through the M276 control-mode mechanism. (2) **What was measured cleanly** — three substantive findings: turn-cost ratio (~13-15×), recovery rate (35%, mechanism falsifier NOT triggered), bench machinery soundness (P6.1-P6.6 ran end-to-end against new model + new corpus + post-M266 fixed schema without harness bugs). (3) **What was NOT measured cleanly** — three honest gaps: apples-to-apples cost ratio (cross-corpus, not same-corpus; M276 control-mode mechanism ready), non-vacuous CCPA-020 evidence (teacher one-shot bypass + student-side 0/20 → invariant vacuously satisfied), student-side under-contract data (1.5B Qwen too unstable). (4) **Operator-dispatchable next steps in priority order**: Step 1 cheap-now (`PHASE6_COMPLIANCE_ENFORCED=0 bash scripts/phase-6-bench.sh` produces clean falsifier evidence in ~7hr); Step 2 model-acquisition (download Qwen2.5-Coder-7B for non-vacuous CCPA-020 evidence); Step 3 await-aprender#1789 (Qwen3-Coder-30B-MoE under-contract = full Axis 2/3 closure). (5) **Cross-references** — every relevant spec file + evidence file + aprender PR. (6) **Publication readiness** — explicit statement that the honest-disclosure form is canonical for publication. **Why M278 is substantive (not mechanical)**: the synthesis didn't exist anywhere before; it pulls together M250-M276 across plan / design-audit / evidence / scripts / contract YAML into a single publication-ready entry point. Operator + future maintainers + a publication audience can read ONE doc to understand the Phase 6 arc + its honest limits + the path forward. **No new code in crates/**, no schema bump, no contract YAML bump. M-counter bumped M276 → M278 (M277 was M276-row mechanical refresh via `00c9c67`). **Spec file count bumped 24 → 25** — new phase-6-results-and-next-steps.md (124 lines). | `f351d8e` | #247 |
| **M276** | **Phase 6 bench control mode + analyzer apples-to-apples ratio** — `PHASE6_COMPLIANCE_ENFORCED` env-var toggle on `scripts/phase-6-bench.sh` lets the operator dispatch the SAME corpus + model + budgets WITHOUT `--compliance-enforced` for the clean falsifier control baseline (per [phase-6-design-audit.md § 4](phase-6-design-audit.md)). **Two-file change**: (1) **`scripts/phase-6-bench.sh`** — new `PHASE6_COMPLIANCE_ENFORCED="${PHASE6_COMPLIANCE_ENFORCED:-1}"` env var: `=1` (default) writes treatment evidence to `evidence/under-contract/` + passes `--compliance-enforced --max-consecutive-compliance-failures=N` to `ccpa-arena-bench`; `=0` writes control evidence to `evidence/under-contract-control/` + DOES NOT pass the compliance flags (apr code runs raw against the same fixtures). Header echo prints the active mode label (`under-contract (treatment, --compliance-enforced active)` vs `control baseline (--compliance-enforced DISABLED for apples-to-apples)`). New `bench_mode` + `compliance_enforced` fields written into `scores.json` so each evidence set is self-describing. Also fixed the M274-noted `scores.json::corpus` field copy-paste bug: `fixtures/calibration-and-scale/` → `fixtures/under-contract/`. (2) **`scripts/analyze-under-contract-scores.sh`** — analyzer's `compliance_cost_ratio` section now prefers apples-to-apples (treatment / control) when `evidence/under-contract-control/scores.json` exists, computes both teacher AND student ratios, references the design-audit § 4 falsifier; falls back to cross-corpus M260 comparison with an explicit "NOT apples-to-apples" warning + a one-line dispatch hint pointing at `PHASE6_COMPLIANCE_ENFORCED=0`. **Empirically verified**: `bash -n scripts/phase-6-bench.sh` clean; control-mode dry-run shows header label correct; analyzer in cross-corpus mode prints the warning + dispatch hint (as expected, since no control evidence exists yet). **Path forward to clean falsifier**: operator dispatches `PHASE6_COMPLIANCE_ENFORCED=0 bash scripts/phase-6-bench.sh` (~7hr wall same as treatment) → produces `evidence/under-contract-control/scores.json` → analyzer's next run reads BOTH and prints the clean ratio. **No new code in crates/**, no schema bump on `ccpa_trace`, no contract YAML bump at M276. M-counter bumped M274 → M276 (M275 was M274-row mechanical refresh via `7e27726`). **Spec file count unchanged**: 24. **Why M276 is substantive (not mechanical)**: introduces a new operational mode (control vs treatment) that's the canonical falsifier methodology per design-audit § 4; bench-script API surface grows by 1 env var + 2 new scores.json fields. | `b406a38` | #246 |
Expand Down
6 changes: 5 additions & 1 deletion docs/specifications/status-snapshots.md

Large diffs are not rendered by default.

Loading
Loading