-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathclaude-code-parity-apr-v1.yaml
More file actions
3236 lines (2887 loc) · 181 KB
/
claude-code-parity-apr-v1.yaml
File metadata and controls
3236 lines (2887 loc) · 181 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# ─────────────────────────────────────────────────────────────────────────────
# claude-code-parity-apr-v1
#
# SOURCE OF TRUTH for the runtime, behavioral parity harness between Claude
# Code (teacher) and `apr code` (student). Pins:
# - the `.ccpa-trace.jsonl` schema (recorded action stream)
# - replay determinism semantics
# - tool-call + file-mutation equivalence rules
# - sovereignty constraint (no api.anthropic.com on replay)
# - aggregate parity-score bound
#
# Sibling contracts (axes that are NOT this contract):
# - apr-code-parity-v1.yaml — STATIC feature matrix (does the symbol exist)
# - apr-claude-proxy-v1.yaml — HTTP/SSE Messages-API request/response shape
# - apr-mcp-server-v1.yaml — MCP JSON-RPC protocol surface
# This contract is the fourth leg: behavioral, runtime, fixture-driven.
#
# Direction: YAML IS AUTHORITATIVE. The trace serializer (ccpa-trace), the
# differ (ccpa-differ) and the replay driver (ccpa-replayer) all consume this
# file. Hand-editing schema or rule shape in Rust without first updating this
# YAML is rejected by the falsification suite once M6 lands.
#
# Spec: docs/specifications/claude-code-parity-apr-poc.md
# Repo (planned, downstream): https://github.com/paiml/claude-code-parity-apr
# ─────────────────────────────────────────────────────────────────────────────
metadata:
version: 1.2.0
created: '2026-04-26'
last_modified: '2026-04-27'
author: PAIML Engineering
# `pv validate` reads ContractKind from metadata.kind (see
# crates/aprender-contracts/src/schema/kind.rs::ContractKind). This is a
# cross-cutting parity/audit pattern, not a mathematical kernel — so the
# schema-dispatch kind is `pattern`. The domain kind is preserved under
# metadata.behavioral_parity_kind for documentation. Same approach as
# apr-code-parity-v1.yaml and apr-claude-proxy-v1.yaml.
kind: pattern
behavioral_parity_kind: ClaudeCodeBehavioralParityContract
description: >
Falsifiable runtime-parity harness between Claude Code (teacher) and
`apr code` (student). Captures Claude Code as a recorded action stream
via an HTTPS proxy at ANTHROPIC_BASE_URL, replays the same prompts to
`apr code` with mocked LLM responses (so orchestration is the only thing
under test), and gates the diff under eight falsification conditions
covering schema, determinism, mock completeness, tool-call equivalence,
file-mutation equivalence, sovereignty, corpus coverage and parity-score.
references:
- 'docs/specifications/claude-code-parity-apr-poc.md (this contract''s spec)'
- 'contracts/apr-code-parity-v1.yaml — sibling static feature matrix'
- 'contracts/apr-claude-proxy-v1.yaml — sibling Messages-API shape contract'
- 'crates/aprender-orchestrate/contracts/batuta/apr-code-v1.yaml — agent-loop semantics'
- 'CLAUDE.md § "Contract Validation: DOGFOOD pv, NEVER bash"'
- 'memory: feedback_monorepo_single_source_of_truth.md (downstream-consumer pattern)'
- 'memory: feedback_pv_not_bash_for_contracts.md (every gate flows through pv)'
- 'Anthropic Messages API — https://docs.anthropic.com/en/api/messages'
- 'Hinton et al. 2015 — Distilling the Knowledge in a Neural Network (action-stream variant)'
spec: docs/specifications/claude-code-parity-apr-poc.md
epic: PMAT-CCPA-PARITY-001 # to be opened on M0 merge
related_contracts:
- contracts/apr-code-parity-v1.yaml
- contracts/apr-claude-proxy-v1.yaml
- crates/aprender-orchestrate/contracts/batuta/apr-code-v1.yaml
name: claude-code-parity-apr
version: "1.29.0"
status: ACTIVE_RUNTIME # 18/18 gates registered; 4 with status: ACTIVE_RUNTIME (CCPA-013/014/015/016 — the runtime-evidence + outcome-parity track) + 2 with status: PROPOSED (CCPA-017 project-scale parity + CCPA-018 Arena recovery-rate, both awaiting first operator-dispatched bench to flip ACTIVE_RUNTIME at v1.30.0), rest at PLANNED_M*/IN_REVIEW/HARD_BLOCKING_M16 per their lifecycle phase. No OPEN residue. v1.29.0 (companion-repo M194-M206 Phase 5 sequence, 2026-05-15) — adds FALSIFY-CCPA-018 (arena_recovery_rate_bound) to the gate registry. Phase 5 operationalizes design-audit.md (M192 operator-authored) R2 + R3 recommendations: a live multi-turn execution harness (crates/ccpa-arena/) where the agent gets bash/test feedback per turn and must recover from failures. The M196 P5.1 scaffolding shipped the ArenaSession + ArenaDriver + OracleCmd + TurnRecord types; M200 P5.2 shipped the real multi-turn loop body (crates/ccpa-arena/src/dispatch.rs with Bash/Read/Write/Edit dispatch via std::process::Command + std::fs); M202 P5.3 shipped SubprocessDriver + bin/ccpa-arena-bench (clap CLI) + scripts/phase-5-arena-bench.sh (operator-dispatch wrapper analogous to phase-4-bench.sh); M204 P5.4 shipped CCPA-018 gate test (asserts recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 with bidirectional sensitivity verified via synthetic identity/regression/give-up-fast fixtures — the asymmetric give-up-fast test is the canonical R3 distinguishing case: 100% pass rate BUT zero recovery FAILS the gate); M206 P5.5 shipped the falsifier-of-falsifier comparator (crates/ccpa-arena/src/falsifier.rs + evaluate_static_vs_arena() returning FalsifierVerdict with StaticFalsified/StaticValidated/Inconclusive outcomes per design-audit.md §5's Popperian test). CCPA-018 enters at status: PROPOSED because no operator-dispatched Arena bench has produced evidence/phase-5/arena-scores.json yet; the live-evidence test is #[ignore]'d until that exists. Threshold values (0.5 recovery / 0.3 oracle) are tentative POC-tier floors — they WILL be recalibrated after first operator dispatch. Phase 5 is distinct from Phase 4 (CCPA-017): CCPA-017 measures FUNCTIONAL OUTCOME (does the code work?); CCPA-018 measures AGENT QUALITY (does the agent recover when bash fails?). v1.28.0 (companion-repo M180-M188 Phase 4 sequence, 2026-05-15) — adds FALSIFY-CCPA-017 (project_scale_parity_bound) to the gate registry. Phase 4 operationalizes the M159 ProgramBench prior-art (arXiv:2605.03546, 0%/200 SOTA baseline) into companion-tier project-scale parity testing: the M182 corpus draws 5 fixtures from real open GitHub issues across paiml/decy + paiml/bashrs + paiml/depyler with pinned pre-fix commit SHAs; the M184 runner (scripts/phase-4-bench.sh, 288 lines bash) clones at the pinned SHA, dispatches each system with timeout APR_TIMEOUT_S (default 900s), snapshots diff vs SHA, runs the per-fixture oracle_cmd; the M186 scorer (crates/ccpa-differ/src/project_scale_diff.rs, ~310 lines Rust) lifts the runner JSON into ProjectScaleParityReport with 5 derived metrics (per-fixture: approach_match + lines_edited_ratio; corpus-level: partial_agreement + files_jaccard_corpus + approach_match_rate); the M188 gate test (crates/ccpa-differ/tests/falsify_ccpa_017_project_scale_parity.rs, ~260 lines, 7 active + 1 #[ignore]'d) asserts partial_agreement >= 0.3 AND files_jaccard_corpus >= 0.3 with bidirectional sensitivity verified on synthetic identity (passes) and synthetic regression (fails) fixtures. CCPA-017 enters at status: PROPOSED because no operator-dispatched measurement has produced evidence/phase-4/project-scale-scores.json yet; the live-evidence test is #[ignore]'d until that exists. Threshold values (0.3/0.3) are tentative POC-tier floors — they WILL be recalibrated after first operator dispatch. Phase 4 is the SIGNAL regime, not the SATURATION regime: a CCPA-016-style "agreement = 1.0" result is implausible at project-scale per ProgramBench evidence; the goal is "do both systems make matching partial progress?" not "do both systems fully succeed?". v1.27.0 (companion-repo M167, 2026-05-14) — flips FALSIFY-CCPA-013 (first_recorded_parity_score) from `status: OPEN` → `status: ACTIVE_RUNTIME`. The gate's assertion has been satisfied since v1.1.0 (3 measured_parity blocks dating 2026-04-27 against `fixtures/canonical/` with aggregate_score = 1.0000), but the gate-level status field was never flipped — stale prose that this revision corrects. Also extends the assertion's `fixture_corpus_path` constraint to accept EITHER `fixtures/canonical/` (AUTHORED, since v1.2.0) OR `evidence/phase-3/captures/` (REAL-BINARY bilateral bench, companion-repo M150 — claude 2.1.139 + apr 0.32.0 + Qwen2.5-Coder-1.5B-Instruct-Q4_K_M, agreement = 1.0000 on MultiPL-E-Rust HumanEval/0..4). Adds a 4th measured_parity block under CCPA-013 recording M150's real-binary evidence as the strongest empirical discharge anchor. **CCPA-013 was the last gate stuck at `status: OPEN`** — its flip closes the OPEN residue. v1.26.0 (companion-repo M147+M152+M162 Phase 3 sequence, 2026-05-13) (companion-repo M147+M152+M162 Phase 3 sequence, 2026-05-13) — adds FALSIFY-CCPA-015 (ccpa_trace_subproc_output_purity) AND FALSIFY-CCPA-016 (outcome_parity_bound) to the gate registry. CCPA-015 was authored at M147 via provable-contract design (falsifying test FIRST, fix via Stdio::null()) for the ccpa-trace-subproc capture binary; PROPOSED in v1.25.0, promoted ACTIVE_RUNTIME here. CCPA-016 is the Phase 3 P3.4 outcome-parity gate authored at M152 — asserts aggregate agreement >= 0.5 on a MultiPL-E-Rust-class corpus with bidirectional sensitivity (synthetic regression fixture fails threshold; synthetic identity passes). CCPA-016 was empirically validated at M150 (real bilateral bench produced agreement = 1.0000 on 5/5 HumanEval/0..4 with real claude 2.1.139 + real apr code 0.32.0 via Qwen2.5-Coder-1.5B-Instruct-Q4_K_M). The companion-repo M162 row records that aprender#1638 MERGED upstream at squash b61b76b4 (2026-05-13), un-gating apr code from `--features code` so `cargo install apr-cli` ships it by default — the Axis 3 LlmDriver-adapter discharge is FULLY confirmed. v1.25.0 (companion-repo M136-M140 axis-2-closure-plan sequence, 2026-05-11) — adds FALSIFY-CCPA-014 (companion-repo M136-M140 axis-2-closure-plan sequence, 2026-05-11) — adds FALSIFY-CCPA-014 (os_event_parity_bound) to the gate registry, completing the axis-2 closure-plan idea (2) CLI subprocess instrumentation track. New gate consumes ccpa_subproc::OsEvent records (M136) via ccpa_differ::os_event_parity (M137) and asserts canonical-corpus score >= 0.95 + bidirectional sensitivity on regression corpus (M139). v1.24.0 (companion-repo M128-M131 sequence, 2026-05-10) — bumped from v1.23.0 to integrate the M109 cosine-vs-HF-FP16 LIVE-DISCHARGE (cos_sim 0.995384 ≥ 0.99 on lambda-vector RTX 4090, 2026-05-09; aprender PR #1597 squash 3fb04ef86 flipped `qwen3-moe-forward-v1` v1.4.0 ACTIVE_ALGORITHM_LEVEL → v1.5.0 ACTIVE_RUNTIME). Discharges the v1.23.0 status-prose claim "Cosine vs HF FP16 remains operator-confirm pending ~60 GB HF download" — the FP16 weights had been on lambda-vector at /mnt/nvme-raid0/models/Qwen3-Coder-30B-A3B-Instruct/ (57 GB / 16 safetensors shards) for ~7 days; the "60 GB download" blocker was stale by 62 days. v1.23.0 (M35 M32d discharge audit-trail bump) records the 4-bug stack landed on aprender main as commit 5235aaeb9 (#1228) plus diagnostic surface PRs #1222 (Step 2), #1226 (Step 2.5), #1401 (Step 2 JSON wire). M32d gibberish output ("%%%%%%%%") converted to coherent English answers across math/geography/translation/code domains. M34 FAST PATH 5-whys plan delivered at lucky-case bound (5 substantive PRs vs 4-6 estimated, ~6 hours wall vs 2-3 days). Component priors verified empirically: rank-3 Q/K RMSNorm (15%) + rank-4 rope_theta (10%) + chat template both correct. Cosine vs HF FP16 formal flip **DISCHARGED 2026-05-09 at companion-repo M109** (apr_argmax = hf_argmax = 3555 " What"; 555ms apr-forward; HF FP16 fixture generated in 52s).
# ─────────────────────────────────────────────────────────────────────────────
# Top-level invariants — the 12 falsifiable gates this contract asserts.
# ─────────────────────────────────────────────────────────────────────────────
# Dual-encoded: structurally re-enumerated below under `falsification_conditions`
# with full assertion / harness / cross-check details. Listed here as a
# top-level array for (a) quick-scan summary and (b) pmat CB-1305 contract-
# surface classification (this contract classifies as `InvariantsOnly`).
invariants:
- { id: FALSIFY-CCPA-009, name: ci_main_branch_green, summary: 'companion repo branch protection requires ci/gate' }
- { id: FALSIFY-CCPA-010, name: pmat_comply_100pct, summary: 'pmat comply check reports is_compliant=true with 0 Fail-status checks' }
- { id: FALSIFY-CCPA-011, name: line_coverage_100pct, summary: 'cargo llvm-cov: 100% function coverage AND >=99% line coverage' }
- { id: FALSIFY-CCPA-012, name: pv_contract_gate_on_commit, summary: 'pre-commit hook + CI both run pv validate' }
- { id: FALSIFY-CCPA-001, name: trace_schema_roundtrip, summary: 'every fixture parses + re-serializes byte-identical' }
- { id: FALSIFY-CCPA-002, name: replay_determinism, summary: 'two replays of the same fixture produce identical student traces' }
- { id: FALSIFY-CCPA-003, name: mock_completeness, summary: 'RecordedDriver consumes every teacher turn exactly once' }
- { id: FALSIFY-CCPA-004, name: tool_call_equivalence, summary: 'student tool-calls equal teacher tool-calls under per-tool rules' }
- { id: FALSIFY-CCPA-005, name: file_mutation_equivalence, summary: 'CWD diff after replay matches CWD diff after teacher run' }
- { id: FALSIFY-CCPA-006, name: sovereignty_on_replay, summary: 'no outbound api.anthropic.com sockets during replay' }
- { id: FALSIFY-CCPA-007, name: corpus_coverage, summary: '>=1 fixture per non-MISSING row of apr-code-parity-v1.yaml' }
- { id: FALSIFY-CCPA-008, name: parity_score_bound, summary: 'aggregate parity_score >= 0.95, per-fixture >= 0.80' }
- { id: FALSIFY-CCPA-013, name: first_recorded_parity_score, summary: 'AT LEAST ONE real Claude Code ↔ apr code corpus run produced a measured parity_score recorded in status_history. Flips ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME.' }
- { id: FALSIFY-CCPA-014, name: os_event_parity_bound, summary: 'OS-level event parity (axis-2-closure-plan M115.4): macro-averaged Jaccard >= 0.95 per fixture in fixtures/os-canonical/; bidirectional-sensitivity gate on fixtures/os-regression/ (every fixture < 0.95 + non-empty drift records).' }
- { id: FALSIFY-CCPA-015, name: ccpa_trace_subproc_output_purity, summary: 'Every line emitted to stdout by ccpa-trace-subproc MUST decode as a ccpa_subproc::OsEvent JSON object. Subprocess stdout MUST NOT interleave with the capture stream (use Stdio::null() not Stdio::inherit()).' }
- { id: FALSIFY-CCPA-016, name: outcome_parity_bound, summary: 'Outcome parity (Phase 3 P3.4): aggregate agreement on a MultiPL-E-Rust-class corpus >= 0.5 (POC-tier); bidirectional-sensitivity via synthetic regression (< 0.5 → fail) + synthetic identity (1.0 → pass) fixtures.' }
- { id: FALSIFY-CCPA-017, name: project_scale_parity_bound, summary: 'Project-scale parity (Phase 4 P4.4): aggregate partial_agreement >= 0.3 AND files_jaccard_corpus >= 0.3 on a multi-file Cargo-workspace task corpus drawn from real GitHub issues (companion-repo M182). Bidirectional-sensitivity via synthetic identity (passes) + synthetic regression (fails) fixtures. PROPOSED at v1.28.0; ACTIVE_RUNTIME pending first operator-dispatched measurement.' }
- { id: FALSIFY-CCPA-018, name: arena_recovery_rate_bound, summary: 'Arena recovery-rate (Phase 5 P5.4): aggregate recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on a multi-turn live Arena bench (companion-repo M196-M206). Measures AGENT QUALITY (does the agent recover from failed bash/test runs?) distinct from CCPA-016/017 functional-outcome metrics. Bidirectional-sensitivity via synthetic identity (passes) + regression (fails) + give-up-fast (asymmetric: 100% pass but zero recovery FAILS recovery floor — the canonical R3 distinguishing test). PROPOSED at v1.29.0; ACTIVE_RUNTIME pending first operator-dispatched Arena bench.' }
scope: >
every recorded fixture under <ccpa-repo>/fixtures/, every replay run the
ccpa-cli produces, AND every CI run on the companion repo. The four
source-of-truth invariants (FALSIFY-CCPA-009..012) gate every PR from M0
onward; the eight parity gates (001..008) come online M1..M6. Out of
scope: live api.anthropic.com calls during CI; teacher-side semantic
content of assistant messages.
# ─────────────────────────────────────────────────────────────────────────────
# Companion-repo source-of-truth policy
# ─────────────────────────────────────────────────────────────────────────────
# `claude-code-parity-apr` (the companion repo) is canonical for ENFORCEMENT —
# implementation, fixtures, CI, coverage, pmat-comply, and the running binary
# that consumes this contract. The contract TEXT (this YAML) lives in
# aprender/contracts/ per the monorepo single-source-of-truth policy
# (memory: feedback_monorepo_single_source_of_truth.md). The companion repo
# pins this contract by commit hash via contracts/pin.lock, and gates every
# PR against it via FALSIFY-CCPA-012.
#
# This split is intentional: aprender stays the canonical home for contract
# TEXT (the schema lives there, `pv` validates from there), while the
# companion repo is canonical for runtime ENFORCEMENT (its CI is the gate
# users see when they open a PR; its coverage is what users measure; its
# pmat-comply config is what users edit). M1 relocates this YAML into the
# companion repo as the canonical copy and replaces the aprender-side copy
# with a redirect note.
companion_repo:
url: 'https://github.com/paiml/claude-code-parity-apr'
branch: main
branch_protection:
required_status_checks:
- 'ci/gate'
enforce_admins: true
require_linear_history: true
allow_force_pushes: false
allow_deletions: false
required_invariants:
- FALSIFY-CCPA-009 # ci_main_branch_green
- FALSIFY-CCPA-010 # pmat_comply_100pct
- FALSIFY-CCPA-011 # line_coverage_100pct
- FALSIFY-CCPA-012 # pv_contract_gate_on_commit
contract_pin:
file: 'contracts/pin.lock'
schema: 'aprender_commit_hash + aprender_contract_path + sha256(yaml)'
refreshed_by: 'PR-time GitHub Action that fetches aprender@main, recomputes sha256, fails if drift'
forbidden_tools:
- cargo-tarpaulin # CLAUDE.md § "Prohibited Tools" — slow, unreliable
- 'bash gates re-implementing pv' # CLAUDE.md § "DOGFOOD pv, NEVER bash"
# ─────────────────────────────────────────────────────────────────────────────
# Harness policy (mirrors apr-code-parity-v1.yaml § harness_policy)
# ─────────────────────────────────────────────────────────────────────────────
# Every falsification gate below MUST be enforced via `pv validate` (binary
# from the in-tree `aprender-contracts-cli` crate). Bash/yq/python wrappers
# are explicitly rejected by CLAUDE.md § "Contract Validation: DOGFOOD pv,
# NEVER bash" and by memory feedback_pv_not_bash_for_contracts.md.
#
# If `pv validate` does not yet support a needed assertion shape (e.g. the
# parity-score reduction in FALSIFY-CCPA-008), the fix is to extend
# aprender-contracts/src/schema/ and aprender-contracts/src/eval/ — never to
# bypass with a shell script. Schema-extension ticket: PMAT-CONTRACTS-CCPA-001.
harness_policy:
dogfood_tool: aprender-contracts-cli
binary: pv
forbidden_alternatives: [bash, shell, yq-wrapper, python-script]
schema_extension_ticket: PMAT-CONTRACTS-CCPA-001
rationale_ref: 'CLAUDE.md § "Contract Validation: DOGFOOD pv, NEVER bash"'
# ─────────────────────────────────────────────────────────────────────────────
# Roles + binary surfaces
# ─────────────────────────────────────────────────────────────────────────────
roles:
teacher:
binary: claude
label: Anthropic Claude Code (closed-source)
instrumentation: 'recording HTTPS proxy at $ANTHROPIC_BASE_URL'
api_target: 'api.anthropic.com (live, recording phase only)'
student:
binary: apr
subcommand: 'apr code'
label: aprender code agent (open-source, this monorepo)
instrumentation: 'crates/aprender-orchestrate `LlmDriver` impl swapped to RecordedDriver'
model_default: 'unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF Q4_K_M (per apr-claude-proxy-v1)'
recorder:
binary: ccpa
subcommand: 'ccpa record'
label: 'mitm-style HTTPS proxy that mints fixtures'
replayer:
binary: ccpa
subcommand: 'ccpa replay'
differ:
binary: ccpa
subcommand: 'ccpa diff'
# ─────────────────────────────────────────────────────────────────────────────
# Trace schema (.ccpa-trace.jsonl) — JSONL, one record per line
# ─────────────────────────────────────────────────────────────────────────────
# Consumed by ccpa-trace (serde Rust) and ccpa-differ. Schema below is the
# source-of-truth; Rust types must be `serde(deny_unknown_fields)`. New record
# kinds bump trace_schema_version and require a contract revision.
trace_schema:
version: 2 # M15 — added HookEvent + SkillInvocation; per-record `v` field still 1 (record-level back-compat layer unused so far)
format: 'application/x-ndjson'
filename_pattern: 'fixtures/[0-9]{4}-[a-z0-9-]+.ccpa-trace.jsonl'
envelope:
type: object
required: [v, kind]
properties:
v:
type: integer
const: 1
description: 'per-record schema version; trace_schema.version is the file-level surface (now v2)'
kind:
type: string
enum: [session_start, user_prompt, assistant_turn, tool_result, session_end, hook_event, skill_invocation]
records:
session_start:
required: [v, kind, session_id, ts, actor, model, cwd_sha256]
properties:
session_id: { type: string, format: uuid, description: 'UUIDv7 — normalized to <SESSION> on diff' }
ts: { type: string, format: date-time, description: 'ISO-8601 UTC — normalized to <TS> on diff' }
actor: { type: string, enum: [claude-code, apr-code] }
model: { type: string, description: 'Model id; teacher and student differ here by design' }
cwd_sha256: { type: string, pattern: '^[0-9a-f]{64}$', description: 'git tree hash of CWD at session start; teacher==student is asserted' }
user_prompt:
required: [v, kind, turn, text]
properties:
turn: { type: integer, minimum: 0 }
text: { type: string }
assistant_turn:
required: [v, kind, turn, blocks, stop_reason]
properties:
turn: { type: integer, minimum: 1 }
blocks:
type: array
minItems: 1
items:
oneOf:
- { $ref: '#/definitions/block_text' }
- { $ref: '#/definitions/block_thinking' }
- { $ref: '#/definitions/block_tool_use' }
stop_reason: { type: string, enum: [end_turn, max_tokens, stop_sequence, tool_use] }
tool_result:
required: [v, kind, turn, tool_use_id, ok, content]
properties:
turn: { type: integer, minimum: 2 }
tool_use_id: { type: string, description: 'Normalized to <TOOL-N> on diff' }
ok: { type: boolean }
content: { type: string, description: 'Tool stdout/stderr concatenation OR structured JSON' }
side_effects:
type: object
additionalProperties: false
properties:
files_read: { type: array, items: { type: string } }
files_written: { type: array, items: { type: string } }
exit_code: { type: integer }
session_end:
required: [v, kind, turn, stop_reason]
properties:
turn: { type: integer, minimum: 1 }
stop_reason: { type: string, enum: [end_turn, max_tokens, stop_sequence, error] }
elapsed_ms: { type: integer, minimum: 0 }
tokens_in: { type: integer, minimum: 0 }
tokens_out: { type: integer, minimum: 0 }
normalization_rules:
- 'session_id → <SESSION> for diff purposes'
- 'ts → <TS> for diff purposes'
- 'tool_use.id and tool_result.tool_use_id → <TOOL-N> per turn, dense-numbered'
- 'absolute paths under CWD → ${CWD}-relative'
- 'all other fields are byte-exact under diff'
definitions:
block_text:
type: object
required: [type, text]
additionalProperties: false
properties:
type: { const: text }
text: { type: string }
block_thinking:
type: object
required: [type, thinking]
additionalProperties: false
properties:
type: { const: thinking }
thinking: { type: string }
signature: { type: string }
block_tool_use:
type: object
required: [type, id, name, input]
additionalProperties: false
properties:
type: { const: tool_use }
id: { type: string }
name: { type: string }
input: { type: object }
# ─────────────────────────────────────────────────────────────────────────────
# Tool-call equivalence rules (consumed by FALSIFY-CCPA-004)
# ─────────────────────────────────────────────────────────────────────────────
# Each tool the teacher emits MUST have an equivalence rule below.
# Adding a new tool to apr code's registry requires a new entry here.
tool_equivalence_rules:
- tool: Bash
semantic_input: 'normalize(command) — collapse runs of whitespace, drop trailing semicolons'
equality: 'string equality after normalize'
- tool: Read
semantic_input: '(path_normalized, offset?, limit?)'
equality: 'tuple equality; offset/limit absent ≡ 0/EOF'
- tool: Write
semantic_input: '(path_normalized, content_sha256)'
equality: 'tuple equality; content compared by sha256, not bytes-in-trace'
- tool: Edit
semantic_input: '(path_normalized, post_state_sha256)'
equality: 'apply edit to the pre-state file, sha256 the result; teacher and student must produce the same post_state_sha256'
rationale: 'Different patch texts that produce the same file are equivalent. Same-file-after is what users actually care about.'
- tool: Glob
semantic_input: 'pattern (verbatim)'
equality: 'string equality'
- tool: Grep
semantic_input: '(pattern, path_normalized, regex_or_literal)'
equality: 'tuple equality after pattern.trim()'
- tool: Agent
semantic_input: '(subagent_type, prompt_sha256)'
equality: 'tuple equality; subagent_type drawn from the closed Claude Code roster (general-purpose|explore|plan)'
note: 'agent_id not compared — it''s session-scoped'
- tool: '*' # default rule for any tool not enumerated above
semantic_input: 'JSON canonicalize(input) sha256'
equality: 'sha256 equality'
# ─────────────────────────────────────────────────────────────────────────────
# File-mutation equivalence (consumed by FALSIFY-CCPA-005)
# ─────────────────────────────────────────────────────────────────────────────
file_mutation_equivalence:
measured_at: 'session_end'
algorithm: |
1. snapshot CWD git tree hash at session_start (cwd_sha256)
2. snapshot CWD git tree hash at session_end (after_sha256)
3. compute diff = (after_teacher_sha256, after_student_sha256)
4. equivalent iff diff is empty OR every differing file passes per_file_rule below
per_file_rule:
- filetype: '*.rs'
rule: 'rustfmt --check on both produces same canonical form'
- filetype: '*.toml'
rule: 'taplo fmt --stdin produces same canonical form'
- filetype: '*.md'
rule: 'normalize trailing whitespace + final newline; compare'
- filetype: '*'
rule: 'byte-exact'
excluded_paths:
- 'target/**'
- '.git/**'
- '*.lock' # Cargo.lock churn from teacher tooling does not count
# ─────────────────────────────────────────────────────────────────────────────
# Parity score reduction (consumed by FALSIFY-CCPA-008)
# ─────────────────────────────────────────────────────────────────────────────
parity_score:
per_fixture: |
score = (matched_actions / total_teacher_actions)
matched_actions = sum over teacher.assistant_turn[*].tool_use[*]:
1 if student emitted an equivalent (tool, semantic_input) at the same turn position
else 0
file_mutation_match adds 1 to numerator and denominator
thresholds:
aggregate_min: 0.95 # corpus mean; FALSIFY-CCPA-008 fails below this
individual_min: 0.80 # any single fixture below this also fails the gate
drift_categories:
- extraneous_llm_call # student called LLM at unexpected point
- missing_tool_call # teacher called Bash, student didn't
- mismatched_tool_input # tool name matched, semantic_input didn't
- mismatched_file_state # file_mutation_equivalence failed
- extra_tool_call # student called Bash, teacher didn't
- turn_order_skew # right calls, wrong order
# ─────────────────────────────────────────────────────────────────────────────
# Sovereignty constraint (consumed by FALSIFY-CCPA-006)
# ─────────────────────────────────────────────────────────────────────────────
sovereignty:
forbidden_replay_egress:
- 'api.anthropic.com'
- '*.anthropic.com'
enforcement: 'CI test container drops all outbound except 127.0.0.1; network policy asserted by ccpa-replayer test harness, not by trust'
rationale: >
Sovereign-AI is aprender''s core value prop. A replay run that silently
talks to api.anthropic.com (e.g. via a leaked teacher API key in env) is a
ship-blocker bug — same posture as apr-claude-proxy-v1 § FALSIFY-CLAUDE-PROXY-006.
# ─────────────────────────────────────────────────────────────────────────────
# Falsification conditions
# ─────────────────────────────────────────────────────────────────────────────
# Each gate maps 1:1 to a Rust test in the planned ccpa-* crates.
# Status legend:
# PLANNED_M{n} — defined here, lands when milestone M{n} ships
# ACTIVE — green in CI on the planned ccpa-cli repo
# Promotion gate: contract status DRAFT → ACTIVE only when all 8 are ACTIVE.
falsification_conditions:
- id: FALSIFY-CCPA-001
name: trace_schema_roundtrip
status: PLANNED_M1
assertion: >
Every committed fixture under fixtures/*.ccpa-trace.jsonl parses cleanly
against trace_schema, re-serializes byte-identical (modulo lexicographic
JSON key ordering), and round-trips through ccpa-trace::Trace::{from_jsonl,
to_jsonl} without loss.
test_harness: 'ccpa-trace/tests/falsify_ccpa_001_roundtrip.rs (planned)'
failing_examples:
- 'unknown record kind'
- 'session_start with non-uuid session_id'
- 'cwd_sha256 not 64-hex-chars'
- 'extra field on assistant_turn (deny_unknown_fields)'
- id: FALSIFY-CCPA-002
name: replay_determinism
status: PLANNED_M3
assertion: >
Replaying the same fixture twice with the same `apr code` revision and
empty $TMPDIR produces byte-identical student traces after applying
trace_schema.normalization_rules. Determinism is across two
back-to-back runs in the same container.
test_harness: 'ccpa-replayer/tests/falsify_ccpa_002_determinism.rs (planned)'
failing_examples:
- 'wallclock leaking into a tool argument'
- 'HashMap iteration order in a tool input'
- id: FALSIFY-CCPA-003
name: mock_completeness
status: PLANNED_M3
assertion: >
RecordedDriver consumes exactly len(teacher.assistant_turns) responses;
no missing turn (panic), no extra turn (assertion). On a well-formed
fixture, ccpa replay exits 0 with stdout containing
"consumed all N teacher turns".
test_harness: 'ccpa-replayer/tests/falsify_ccpa_003_mock.rs (planned)'
- id: FALSIFY-CCPA-004
name: tool_call_equivalence
status: PLANNED_M4
assertion: >
Per turn, the multiset of (tool_name, semantic_input) pairs in the
student trace equals the multiset in the teacher trace under
tool_equivalence_rules. Inequivalence is reported with a typed
drift_category (one of parity_score.drift_categories).
test_harness: 'ccpa-differ/tests/falsify_ccpa_004_tool_equivalence.rs (planned)'
- id: FALSIFY-CCPA-005
name: file_mutation_equivalence
status: PLANNED_M4
assertion: >
The git-tree hash of the CWD at session_end satisfies
file_mutation_equivalence between teacher and student runs of the same
fixture against the same starting cwd_sha256. Differing files must each
pass per_file_rule.
test_harness: 'ccpa-differ/tests/falsify_ccpa_005_file_state.rs (planned)'
- id: FALSIFY-CCPA-006
name: sovereignty_on_replay
status: PLANNED_M5
assertion: >
During `ccpa replay`, zero outbound socket connections are opened to
hosts in sovereignty.forbidden_replay_egress. Asserted by running the
replay inside a network-namespaced container that drops all egress
except 127.0.0.1; any outbound connection attempt is logged and fails
the test.
test_harness: 'ccpa-replayer/tests/falsify_ccpa_006_sovereignty.rs (planned)'
parity_with: 'apr-claude-proxy-v1 § FALSIFY-CLAUDE-PROXY-006'
- id: FALSIFY-CCPA-007
name: corpus_coverage
status: HARD_BLOCKING_M16
assertion: >
For every row in apr-code-parity-v1.yaml § categories[*] whose status
is in {SHIPPED, PARTIAL} AND whose id is NOT in the contract-declared
out-of-scope list {keyboard-shortcuts, status-line}, at least one
fixture in fixtures/canonical/ exercises that capability (fixture's
meta.toml declares `covers = [<row.id>, ...]`).
MISSING rows are exempt; OOS rows are explicitly excluded.
Reachable = (SHIPPED ∪ PARTIAL) \ OOS.
As of M16, gate semantics are HARD-BLOCKING: a PR that drops any
reachable row from coverage fails CI.
oos_rows:
- id: keyboard-shortcuts
reason: |
REPL keystroke handling (Shift+Tab cycle, Ctrl+C, REPL Phase
1/2 keys). REPL events never cross the trace boundary. The
`!` shell-prefix and `@path` file-injection sub-features are
partially observable via UserPrompt.text but not faithfully
distinguishable from the literal-text alternative. OOS at
schema v2.
- id: status-line
reason: |
REPL render artifact (`StatusLine { model, mode, cost_usd,
branch, cwd_short }`). Pure UI; rendered each frame but
never crosses a trace boundary. Apr-code's `status_line.rs`
primitive is unit-tested in-repo; parity at the rendering
level is a separate concern from action-stream parity. OOS
at schema v2.
test_harness: 'crates/ccpa-differ/tests/falsify_ccpa_007_coverage.rs (15 tests inc. 4 OOS-handling tests added M16)'
cross_check_command: |
ccpa coverage \
--apr-code-parity-yaml ../aprender/contracts/apr-code-parity-v1.yaml \
--fixtures-dir fixtures/canonical/ \
--oos-rows keyboard-shortcuts,status-line
expected_exit: 0 # any non-zero exit FAILS the gate
ci_step: '.github/workflows/ci.yml § "corpus_coverage hard-blocking gate"'
- id: FALSIFY-CCPA-013
name: first_recorded_parity_score
status: ACTIVE_RUNTIME
assertion: |
The contract's status field reaches ACTIVE_RUNTIME if and only if
at least one row of `status_history` contains a `measured_parity:`
block of shape:
{ date, fixture_corpus_path, fixture_count, aggregate_score: <float>,
per_fixture: [{id, score, drift_count}],
teacher_source, student_source }
where:
- aggregate_score >= parity_score.thresholds.aggregate_min (0.95)
- every per_fixture[*].score >= parity_score.thresholds.individual_min (0.80)
- fixture_corpus_path is EITHER `fixtures/canonical/` (AUTHORED
canonical-corpus runs, since v1.2.0) OR `evidence/phase-3/captures/`
(REAL-BINARY bilateral bench, since v1.27.0 / companion-repo M150)
containing >=5 paired teacher/student records
- teacher_source documents how teachers were authored or captured
(e.g. "curated canonical reference; assume Claude Code", or
"real claude binary 2.1.139 on operator's host")
- student_source documents how students were generated or captured
(e.g. "curated", or "ccpa replay vs apr code @<sha>", or
"real apr 0.32.0 + Qwen2.5-Coder-1.5B-Instruct-Q4_K_M")
test_harness: |
companion-repo: `ccpa corpus fixtures/canonical/ --json` produces
the aggregate_score + per_fixture array. The PR landing the
`measured_parity` entry into status_history is what flips the
gate from OPEN -> DISCHARGED.
rationale: |
Through M6 the contract reached ACTIVE because every gate had an
algorithm-level detector. That is necessary but NOT sufficient to
claim "apr code is identical to Claude Code".
v1.0.0 over-claimed by labeling 12 algorithm gates as ACTIVE
without runtime evidence.
v1.1.0 introduced FALSIFY-CCPA-013 as the runtime-discharge gate
and required a fixture corpus recorded from a real Claude Code
HTTPS session via M2.3 proxy.
v1.2.0 (this revision) drops the proxy/recording requirement.
Per project policy "we will not call api, we will assume claude
code", the canonical Claude Code reference is now AUTHORED into
`fixtures/canonical/<id>/teacher.ccpa-trace.jsonl` rather than
recorded over a live network. This is honest: parity is defined
against a CURATED specification of what Claude Code SHOULD do
for each scenario, not against a snapshot of one specific
Claude Code build's network behaviour.
Student traces in `fixtures/canonical/<id>/student.ccpa-trace.jsonl`
can be either:
- curated (representing the apr-code-equivalent we declare correct), or
- generated (recorded from an actual `apr code` run once the
aprender-orchestrate LlmDriver adapter exists)
The `student_source` field of measured_parity records which.
M2.3 (HTTPS proxy) is removed from the spec's path. M3.0
mock-replayer remains; its real-driver follow-up is now
independent of FALSIFY-CCPA-013 discharge.
semantic_change_log:
- { date: '2026-04-27', version_before: '1.0.0', version_after: '1.1.0',
change: 'Added FALSIFY-CCPA-013. Refined status enum: ACTIVE_ALGORITHM_LEVEL ⊂ ACTIVE_RUNTIME. v1.0.0 over-claimed by labeling the 12-gate state as ACTIVE without clarifying that runtime parity was unmeasured.' }
- { date: '2026-04-27', version_before: '1.1.0', version_after: '1.2.0',
change: 'Dropped HTTPS-proxy / live-Claude-Code-recording requirement from FALSIFY-CCPA-013 discharge path per project policy "we will not call api, we will assume claude code". Canonical reference is now AUTHORED in fixtures/canonical/. Per-fixture student traces can be curated or generated. M2.3 proxy removed from spec.' }
- { date: '2026-05-14', version_before: '1.26.0', version_after: '1.27.0',
change: 'Flipped CCPA-013 status from OPEN → ACTIVE_RUNTIME. The assertion has been satisfied since v1.1.0 (3 measured_parity blocks dating 2026-04-27 against fixtures/canonical/), but the gate-level status field was never flipped — stale prose corrected. Also extended fixture_corpus_path constraint to accept evidence/phase-3/captures/ in addition to fixtures/canonical/, recognizing M150 real-binary bilateral bench (claude 2.1.139 + apr 0.32.0 + Qwen2.5-Coder-1.5B-Instruct-Q4_K_M; agreement = 1.0000 on MultiPL-E-Rust HumanEval/0..4) as the strongest empirical discharge anchor. New 4th measured_parity block added below this change-log entry recording the M150 evidence.' }
measured_parity:
date: '2026-05-12'
fixture_corpus_path: 'evidence/phase-3/captures/'
fixture_count: 5
aggregate_score: 1.0000
thresholds: { aggregate_min: 0.95, individual_min: 0.80 }
passes_gate: true
per_fixture:
- { id: 'HumanEval_0_has_close_elements', score: 1.0, drift_count: 0, passes_individual: true }
- { id: 'HumanEval_1_separate_paren_groups', score: 1.0, drift_count: 0, passes_individual: true }
- { id: 'HumanEval_2_truncate_number', score: 1.0, drift_count: 0, passes_individual: true }
- { id: 'HumanEval_3_below_zero', score: 1.0, drift_count: 0, passes_individual: true }
- { id: 'HumanEval_4_mean_absolute_deviation', score: 1.0, drift_count: 0, passes_individual: true }
teacher_source: |
REAL claude binary 2.1.139 invoked on operator's noah-Lambda-Vector
host via /home/noah/.local/bin/claude. Per-fixture invocation:
`claude -p "<prompt from fixtures/multipl-e-rust/<id>/prompt.txt>"`.
Generated Rust written to evidence/phase-3/captures/<id>/teacher.src.rs
as audit-trail evidence. cargo test against fixture's reference
Cargo.toml = PASS for all 5.
student_source: |
REAL apr-cli 0.32.0 invoked on the same host with
--model qwen2.5-coder-1.5b-instruct-q4_k_m.gguf --max-turns 1.
Per-fixture invocation: `apr code -p "<same prompt>"`. Generated
Rust written to evidence/phase-3/captures/<id>/student.src.rs;
cargo test = PASS for all 5. Companion-repo evidence:
evidence/phase-3/multipl-e-rust-scores.json.
capture_milestone: M150
capture_pr: 'companion-repo #136 squash 47bed37'
orthogonal_metrics:
structural_similarity_lines_jaccard: 0.5201 # M153
test_survival_rate_cross_swap: 1.0000 # M154 (10/10 cross-swaps pass)
note: |
Strongest empirical CCPA-013 discharge anchor. Unlike the
2026-04-27 measured_parity blocks (AUTHORED canonical-corpus,
teacher_source = "Curated canonical reference; AUTHORED, not
recorded"), this block records REAL bilateral capture of an
operator-installed claude binary AND an operator-installed apr
code binary on a public benchmark (MultiPL-E-Rust, Cassano et
al. 2022 / arXiv:2208.08227). The M2.3 rescope walked away
from real-teacher recording; M150 walked back to it via outcome
parity (claude generates code → apr code generates code → both
pass the same test oracle). Honest caveats live in
companion-repo docs/specifications/outcome-parity-results.md
§ "What this does NOT prove" — 5-problem POC saturates at 1.0;
full 164-problem MultiPL-E-Rust expansion (M167+ future-work)
will produce a more honest pass@1 curve.
- id: FALSIFY-CCPA-014
name: os_event_parity_bound
status: ACTIVE_RUNTIME
assertion: |
OS-level event parity (axis-2-closure-plan idea (2): CLI
subprocess instrumentation). For every paired fixture in
`fixtures/os-canonical/`:
ccpa_differ::os_event_parity(teacher, student).score() >= 0.95
where the score is a macro-averaged Jaccard over four sets:
- file_open paths
- file_write fd projections
- file_unlink paths
- exec paths
Bidirectional sensitivity: every paired fixture in
`fixtures/os-regression/` MUST score `< 0.95` AND emit
non-empty drift records. A regression fixture that scores
`>= 0.95` is a meter false-negative and ship-blocking.
Trace schema: each fixture file is JSONL where each line is
a `ccpa_subproc::OsEvent` JSON object:
{ "pid": <u32>, "kind": <OsEventKind>, "seq": <u64> }
`OsEventKind` is a tagged enum with variants:
{ "kind": "file_open", "path": "...", "flags": "..." }
{ "kind": "file_write", "fd": "...", "bytes": <u64> }
{ "kind": "file_unlink", "path": "..." }
{ "kind": "exec", "path": "..." }
test_harness: |
`cargo test -p ccpa-differ --test falsify_ccpa_014_os_event_parity`
runs 3 functions:
- canonical_corpus_meets_os_parity_threshold (every fixture >= 0.95)
- regression_corpus_below_os_parity_threshold (every fixture < 0.95 + non-empty drifts)
- identical_traces_score_perfect (self-compare == 1.0)
All three GREEN on the M139 corpus (3 canonical + 1 regression fixtures).
Capture path: `ccpa-trace-subproc <cmd> [args...]` (binary in
crates/ccpa-subproc/, M136) wraps a subprocess under
`strace -f -e trace=open,openat,write,unlink,unlinkat,execve,execveat`
and emits OS-event JSONL to stdout. The differ
(`ccpa_differ::os_event_parity`, M137) consumes those records.
rationale: |
Axis 2 (real differential test against actual Claude Code) was
stuck at ~30% since M2.3 rescope. The completeness-assessment
identified 5 closure paths; idea (2) — CLI subprocess
instrumentation under `strace` — is the cheapest path to
real-input-system-under-test evidence (no Anthropic API
budget needed, no upstream `LlmDriver-public` dependency).
The OS-level differ works at a coarser granularity than the
API-level differ because libc / kernel / runtime layers emit
different ancillary syscalls in different orders between
systems. Multiset Jaccard sidesteps order-sensitivity while
preserving "did both systems touch roughly the same set of
files / spawn the same set of programs?" semantics.
The 0.95 threshold mirrors FALSIFY-CCPA-008's `0.80`
per-fixture floor plus a 0.15 margin: OS-level Jaccard is
set-based and intolerant to single-path divergence, so the
threshold can be tighter than the API-level position-aligned
score.
DRAFT → ACTIVE_RUNTIME at v1.25.0 (this revision) once the
M139 corpus + gate test land.
semantic_change_log:
- { date: '2026-05-11', version_before: '1.24.0', version_after: '1.25.0',
change: 'Added FALSIFY-CCPA-014 to gate registry. Companion-repo M139 ships the corpus + runtime gate; this revision (M140 / M115.5) flips the contract to recognize CCPA-014 as ACTIVE_RUNTIME from authoring.' }
- id: FALSIFY-CCPA-015
name: ccpa_trace_subproc_output_purity
status: ACTIVE_RUNTIME
assertion: |
The `ccpa-trace-subproc` capture binary (crates/ccpa-subproc/, M136)
MUST emit OsEvent JSONL to its stdout WITHOUT contamination by the
wrapped subprocess's own stdout. Every line read from
`ccpa-trace-subproc <cmd> [args...]` stdout MUST decode as a valid
`ccpa_subproc::OsEvent` JSON object:
{ "pid": <u32>, "kind": <OsEventKind>, "seq": <u64> }
Implementation requirement: the spawned subprocess's stdout MUST
be redirected to `Stdio::null()`. Using `Stdio::inherit()` (the
pre-M147 bug) causes the wrapped subprocess's prose output to
interleave with the capture stream — every text line that isn't
valid OsEvent JSON corrupts the trace.
test_harness: |
`cargo test -p ccpa-subproc --test falsify_ccpa_015_output_purity`
spawns `ccpa-trace-subproc /bin/echo CHATTY_TEXT` and asserts that
every line of stdout decodes as `OsEvent`. The test verifies
bidirectional sensitivity: pre-M147 (with Stdio::inherit()) FAILS
because "CHATTY_TEXT" leaks into stdout; post-M147 (with
Stdio::null()) PASSES because only OsEvent records reach stdout.
36 tests green on the current ccpa-subproc workspace: 32 unit
tests + 3 binary smoke tests + 1 falsify_ccpa_015 test.
rationale: |
Phase 2 first-real-capture (M147) surfaced a class of "capture
stream contamination" bugs: the wrapped subprocess's prose output
can leak through the same file descriptor used for the JSONL
capture stream, producing a `teacher.ccpa-os-trace.jsonl` file
that mixes claude's natural-language response with the
OsEvent records the differ expects.
The output-purity invariant is a precondition for any meaningful
OS-level parity analysis: if the trace is corrupted, every
downstream differ produces noise.
Provable-contract design (operator directive "use provable-
contract based design"): the falsifying test was authored BEFORE
the fix, asserted to FAIL on the buggy `Stdio::inherit()` code,
then PASS on the fixed `Stdio::null()` code. The lint encoded by
this gate is now permanent.
PROPOSED at v1.25.0 (M147); promoted ACTIVE_RUNTIME at v1.26.0
(this revision) once Phase 3 work confirmed the invariant holds
across both teacher (claude) and student (apr code) captures on
the operator's host.
semantic_change_log:
- { date: '2026-05-13', version_before: '1.25.0', version_after: '1.26.0',
change: 'Added FALSIFY-CCPA-015 to gate registry. Companion-repo M147 ships the runtime test asserting Stdio::null() output purity for ccpa-trace-subproc; this revision flips the contract to recognize CCPA-015 as ACTIVE_RUNTIME from authoring.' }
- id: FALSIFY-CCPA-016
name: outcome_parity_bound
status: ACTIVE_RUNTIME
assertion: |
Outcome parity (Phase 3 P3.4). On a MultiPL-E-Rust-class corpus of
function-level code-generation tasks where both teacher (claude)
and student (apr code) are asked to produce Rust code:
aggregate `agreement` on the corpus MUST be >= 0.5
Where `agreement` is fraction-of-prompts-where-both-systems-pass:
agreement = (both_passed + both_failed) / corpus_size
Plus consistency invariants:
- `corpus_size >= 3` (minimum sample size for statistical meaning)
- `corpus_size == per_fixture.len()` (record-count match)
- `teacher_pass_rate >= 0.5` (validity: teacher must do better
than chance on the corpus for agreement to be informative)
- `both_passed + both_failed <= corpus_size` (count consistency)
Bidirectional sensitivity (mandatory):
- A synthetic regression fixture with `agreement: 0.4` MUST fail
the threshold check (catches false-negative meter bugs).
- A synthetic identity fixture with `agreement: 1.0` MUST pass
(catches false-positive meter bugs).
Source of truth: `evidence/phase-3/multipl-e-rust-scores.json`
produced by `scripts/phase-3-bench.sh` on the companion repo.
test_harness: |
`cargo test -p ccpa-differ --test falsify_ccpa_016_outcome_parity`
runs 4 assertions:
- live_evidence_meets_outcome_parity_threshold
- live_evidence_per_fixture_exit_codes_consistent_with_aggregate
- synthetic_regression_below_outcome_parity_threshold
- synthetic_identity_passes_outcome_parity_threshold
All four GREEN on the companion-repo M150 bench output:
agreement = 1.0000 over 5 fixtures (HumanEval/0..4, Cassano et
al. 2022 / arXiv:2208.08227).
rationale: |
The M149 operator reframe ("so we can ask apr code to generate
same code as claude code and 'it works'") elevated outcome parity
to the primary user-facing parity test, alongside the pre-existing
procedural-parity track (CCPA-014). Outcome parity asks "does the
generated code work?"; procedural parity asks "do both systems
make the same syscalls?" — these can disagree (different runtimes
produce different syscall sets even when generating equivalent
code) and CCPA-016 was designed to capture the user-facing claim
independently of CCPA-014.
The 0.5 threshold is POC-tier per the outcome-parity-plan.md
§ P3.4 design note ("probably 0.5 initially — both systems pass
on half the corpus is a reasonable bar for a POC"). Expanding the
corpus from the 5-fixture M150 POC to the full 164-problem
MultiPL-E-Rust (M164+) will justify raising the threshold to ~0.8.
M150 (PR #136 squash 47bed37) produced the empirical evidence:
teacher_pass_rate = 1.0000 (5/5), student_pass_rate = 1.0000 (5/5),
agreement = 1.0000, both_passed = 5, both_failed = 0. This was a
REAL bilateral bench using claude 2.1.139 + apr 0.32.0 on the
operator's noah-Lambda-Vector host with Qwen2.5-Coder-1.5B as the
apr-code backing model. Companion-repo M153 added the
orthogonal structural-similarity metric (line-set Jaccard =
0.5201; the systems generate functionally-equivalent but
stylistically-divergent Rust). Companion-repo M154 added the
test-survival metric (10/10 cross-swaps pass; the structural
divergence is purely stylistic, not semantic).
PROPOSED at v1.25.0 (M152); promoted ACTIVE_RUNTIME at v1.26.0
(this revision) once M157's consolidated outcome-parity-results
doc shipped and M162 (aprender#1638 MERGED) confirmed the upstream
LlmDriver-adapter discharge.
semantic_change_log:
- { date: '2026-05-13', version_before: '1.25.0', version_after: '1.26.0',
change: 'Added FALSIFY-CCPA-016 to gate registry. Companion-repo M152 ships the gate test against live evidence/phase-3/multipl-e-rust-scores.json; M150 produced the bilateral bench empirical evidence; this revision flips the contract to recognize CCPA-016 as ACTIVE_RUNTIME from authoring.' }
- id: FALSIFY-CCPA-017
name: project_scale_parity_bound
status: PROPOSED
assertion: |
Project-scale parity (Phase 4 P4.4). On a multi-file Cargo-workspace
task corpus where each task is drawn from a real open GitHub issue
(companion-repo M182: fixtures/project-scale/ initially 5 fixtures
across paiml/decy + paiml/bashrs + paiml/depyler), both teacher
(claude) and student (apr code) are dispatched in a clone of the
pinned pre_fix_commit SHA + given the issue body as their prompt
+ their final repo state is scored against the per-fixture
completion oracle_cmd. The aggregate project-scale parity report
MUST satisfy BOTH:
- aggregate `partial_agreement` >= 0.3
- aggregate `files_jaccard_corpus` >= 0.3
Where derived metrics are:
partial_agreement = mean over fixtures of
min(teacher.oracle_pass, student.oracle_pass)
files_jaccard_corpus = mean over fixtures of
|teacher.files_touched ∩ student.files_touched|
/ |teacher.files_touched ∪ student.files_touched|
Plus consistency invariants:
- `corpus_size >= 3` (minimum sample size for statistical meaning)
- `corpus_size == per_fixture.len()` (record-count match)
Bidirectional sensitivity (mandatory):
- A synthetic regression fixture (one side passes, other fails,
disjoint files-touched lists) MUST fail BOTH thresholds.
- A synthetic identity fixture (both sides pass on same files
with identical files_touched_jaccard = 1.0) MUST pass.
- An empty-corpus report MUST fail (prevents "no-data" from
being claimed as success).
Source of truth: `evidence/phase-4/project-scale-scores.json`
produced by `scripts/phase-4-bench.sh` on the companion repo.
test_harness: |
`cargo test -p ccpa-differ --test falsify_ccpa_017_project_scale_parity`
runs 7 active assertions + 1 `#[ignore]`'d live-evidence assertion:
- synthetic_identity_corpus_passes_gate
- synthetic_regression_corpus_fails_gate
- empty_corpus_vacuously_fails_threshold
- exactly_at_threshold_passes (verifies >= not >)
- just_below_partial_threshold_fails (single-gate sensitivity)
- just_below_files_threshold_fails (single-gate sensitivity)
- threshold_constants_match_plan (sentinel)
- live_evidence_meets_project_scale_threshold (#[ignore]'d
until operator dispatches `bash scripts/phase-4-bench.sh`)
All 7 active GREEN on the companion-repo M188 scaffold (synthetic
fixtures constructed in-test, no on-disk corpus dependency).
rationale: |
The M180 Phase 4 plan operationalizes the M159 ProgramBench
prior-art (arXiv:2605.03546) into companion-tier project-scale
parity testing. ProgramBench reports 0%/200 fully-resolved across
Claude Opus/Sonnet/Haiku + GPT + Gemini at the project-scale
layer; this evidence validates the M159 caveat "function-level
1.0 does not extrapolate to project-scale" and establishes the
Phase 4 SIGNAL regime: the user-facing parity question is
"do both systems make matching partial progress?" not "do both
systems fully succeed?".
The DUAL-threshold design (partial_agreement >= 0.3 AND
files_jaccard_corpus >= 0.3) is intentional: project-scale parity
has two orthogonal signal channels — pass-rate agreement AND
files-touched overlap. A system could match pass rate without
touching the same files (different solutions to same problem);
or touch the same files without matching pass rate (one fixes
the bug, the other breaks more). Both channels must show
agreement for "project-scale parity" to mean anything.
Threshold values (0.3/0.3) are tentative POC-tier floors. They
WILL be recalibrated after first operator-dispatched measurement
against the M182 corpus. A 0.5/0.5 threshold à la CCPA-016 would
assume saturation that ProgramBench evidence shows doesn't exist
at project-scale; 0.3 is "at least 30% of fixtures see matching
progress" — a plausible POC-tier floor that the M182 corpus
might actually meet.
Status PROPOSED (not ACTIVE_RUNTIME) because no
operator-dispatched measurement has produced
evidence/phase-4/project-scale-scores.json yet. The
live-evidence test is `#[ignore]`'d until that file exists.
Once the operator runs `bash scripts/phase-4-bench.sh` and the
gate passes against real data, a v1.29.0 bump will flip
PROPOSED → ACTIVE_RUNTIME.
Companion-repo Phase 4 sequence (M180-M188):
M180 (PR #167 squash c7107b9) — phase-4-project-scale-plan.md
authored; P4.1-P4.5 sub-deliverables defined.
M182 (PR #169 squash b36ceb6) — P4.1 corpus: 5 fixtures from
paiml/decy#40 + paiml/decy#39 + paiml/bashrs#209 +
paiml/depyler#223 + paiml/depyler#224. Operator directive
"why not use ../decy ../bashrs and ../depy corpus" steered
authoring toward real GitHub issues over synthetic stretch
goals.
M184 (PR #171 squash 0f8c451) — P4.2 runner: phase-4-bench.sh
(288 lines bash); clones at pre_fix_commit SHA + dispatches
+ snapshots + runs oracle + emits per-fixture and aggregate
JSON with files_touched_jaccard via jq set-arithmetic.
M186 (PR #173 squash c115966) — P4.3 scoring: project_scale_diff.rs
(~310 lines Rust) consumes runner JSON + adds 5 derived
metrics + passes_threshold predicate; 14 unit tests GREEN.
M188 (PR #175 squash a574655) — P4.4 gate test:
falsify_ccpa_017_project_scale_parity.rs (~260 lines); 7
synthetic-fixture tests verify bidirectional sensitivity
before any real measurement exists.
semantic_change_log:
- { date: '2026-05-15', version_before: '1.27.0', version_after: '1.28.0',
change: "Added FALSIFY-CCPA-017 to gate registry at status: PROPOSED. Companion-repo M188 ships the gate test scaffold (7 synthetic-fixture tests + 1 #[ignore]'d live-evidence test); thresholds (partial_agreement >= 0.3 AND files_jaccard_corpus >= 0.3) are tentative POC-tier floors awaiting first operator-dispatched measurement to calibrate. Phase 4 P4.5 contract bump." }
- id: FALSIFY-CCPA-018
name: arena_recovery_rate_bound
status: PROPOSED
assertion: |
Arena recovery-rate (Phase 5 P5.4). On a multi-turn live Arena
bench against the M182 project-scale corpus (companion-repo
fixtures/project-scale/) where each task is driven through an
ArenaSession with up to max_turns=20 multi-turn dialog turns and
bash/test execution feedback per turn, the aggregate Arena scores
MUST satisfy BOTH:
- aggregate `recovery_rate` >= 0.5
- aggregate `oracle_passed_rate` >= 0.3
Where derived metrics are:
recovery_rate = (teacher_recovered + student_recovered) /
(corpus_size * 2)
oracle_passed_rate = (teacher_passed + student_passed) /
(corpus_size * 2)
recovery_observed = OraclePassed AND any_bash_failure_in_history
(per side per fixture)
Plus consistency invariants:
- `corpus_size >= 3` (minimum sample size for statistical meaning)
- `corpus_size == per_fixture.len()` (record-count match)
Bidirectional sensitivity (mandatory):
- A synthetic identity fixture (all pass + all recovered) MUST
pass.
- A synthetic regression fixture (no pass, no recovery) MUST fail.
- A synthetic give-up-fast fixture (100% pass BUT zero recovery)
MUST fail on the recovery floor — this is the canonical R3
distinguishing test: a system that solves easy tasks zero-shot
but never recovers from a hard task's first failure is NOT
accepted by CCPA-018.
- An empty-corpus report MUST fail (prevents "no-data" from
being claimed as success).
Source of truth: `evidence/phase-5/arena-scores.json` produced
by `scripts/phase-5-arena-bench.sh` on the companion repo.
CCPA-018 measures AGENT QUALITY (does the agent recover?),
distinct from CCPA-016/017 which measure FUNCTIONAL OUTCOME
(does the code work?). Direct empirical answer to
design-audit.md §6 R3 "self-correction over zero-shot
determinism".
test_harness: |
`cargo test -p ccpa-arena --test falsify_ccpa_018_arena_recovery_rate`
runs 7 active assertions + 1 `#[ignore]`'d live-evidence assertion:
- synthetic_identity_corpus_passes_gate
- synthetic_regression_corpus_fails_gate
- synthetic_give_up_fast_fails_on_recovery_floor (THE canonical
R3 distinguishing test)
- empty_corpus_vacuously_fails_threshold
- exactly_at_thresholds_passes (verifies >= not >)
- just_below_recovery_threshold_fails (single-gate sensitivity)
- threshold_constants_match_plan (sentinel)
- live_evidence_meets_arena_recovery_threshold (#[ignore]'d
until operator dispatches `bash scripts/phase-5-arena-bench.sh`)
Plus the falsifier-of-falsifier comparator at
`cargo test -p ccpa-arena --test falsify_static_vs_arena`
(companion-repo M206 P5.5): 4 active synthetic tests + 1
`#[ignore]`'d live-evidence test that loads BOTH evidence files
(CCPA-016 + CCPA-018) and emits a `FalsifierVerdict` per
design-audit.md §5's Popperian test.
All 7 + 4 active GREEN on the companion-repo M206 scaffold
(synthetic fixtures constructed in-test, no on-disk corpus
dependency).
rationale: |