Skip to content

Latest commit

 

History

History
40 lines (31 loc) · 10.1 KB

File metadata and controls

40 lines (31 loc) · 10.1 KB

Falsification conditions (20 gates total)

Top spec: claude-code-parity-apr-poc.md

20 falsifiable gates: 4 source-of-truth invariants (CCPA-009..012, M0+) and 16 behavioral parity / process gates (CCPA-001..008 + CCPA-013, M1..M11; CCPA-014, M115.4 axis-2-closure-plan; CCPA-015 + CCPA-016, M147 + M152 Phase 3 outcome-parity; CCPA-017, M188 Phase 4 project-scale parity; CCPA-018, M204 Phase 5 Arena recovery-rate; CCPA-019, M236 calibration-required-before-verdict; CCPA-020, M258 Phase 6 contract-compliance-per-turn). All asserted via pv validate contracts/claude-code-parity-apr-v1.yaml per CLAUDE.md § Contract Validation: DOGFOOD pv, NEVER bash. See invariants.md for invariants 1-4.

Falsification conditions (20 gates total)

Source-of-truth invariants (M0+)

ID Name Phase Mechanically asserted by
FALSIFY-CCPA-009 ci_main_branch_green M0+ gh api repos/paiml/claude-code-parity-apr/branches/main/protection returns ci/gate ∈ required contexts
FALSIFY-CCPA-010 pmat_comply_100pct M0+ pmat comply check --json returns compliance_pct == 100.0 ∧ total_violations == 0
FALSIFY-CCPA-011 line_coverage_100pct M0+ cargo llvm-cov --fail-under-lines 100 --fail-uncovered-lines 0 exits 0
FALSIFY-CCPA-012 pv_contract_gate_on_commit M0+ pre-commit hook + CI both run pv validate contracts/claude-code-parity-apr-v1.yaml, exit 0

Behavioral parity gates (M1..M11)

ID Name Phase Assertion summary
FALSIFY-CCPA-001 trace_schema_roundtrip M1 every fixture parses, re-serializes byte-identical, validates against trace_schema
FALSIFY-CCPA-002 replay_determinism M3 replaying same fixture twice → byte-identical student traces (after normalization)
FALSIFY-CCPA-003 mock_completeness M3 RecordedDriver consumes exactly len(teacher.assistant_turns) responses; no missing, no extras
FALSIFY-CCPA-004 tool_call_equivalence M4 per turn, multiset of (tool_name, semantic_input) pairs in student matches teacher under per-tool equivalence rules (Edit: post-state sha256; Bash: normalized command; etc.)
FALSIFY-CCPA-005 file_mutation_equivalence M4 union diff over CWD after apr code finishes equals union diff after Claude Code finished, modulo per-filetype canonicalization
FALSIFY-CCPA-006 sovereignty_on_replay M5 zero outbound sockets to *.anthropic.com during replay; CI test container drops all egress except 127.0.0.1
FALSIFY-CCPA-007 corpus_coverage M5 ≥1 fixture per non-MISSING row of apr-code-parity-v1.yaml (currently 17 of 21)
FALSIFY-CCPA-008 parity_score_bound M6 aggregate parity_score ≥ 0.95 and per-fixture ≥ 0.80. SOFT-DEPRECATED at M230 + status: ADVISORY at v1.30.0 / M232 (post-M224 Popperian StaticFalsified verdict): the gate still enforces the score threshold on the 30 AUTHORED canonical fixtures, but the 1.0000 result is now interpreted as meter validation (the differ correctly recognizes equivalent traces), NOT system-level parity validation. The user-facing parity claims move to CCPA-016 (function-scale outcome) + CCPA-017 (project-scale partial-progress) + CCPA-018 (Arena recovery-rate). See static-fixture-deprecation.md for the full reframe + audit trail. Contract status field is annotated ADVISORY in the CCPA-008 summary at upstream aprender v1.30.0 (aprender#1735).
FALSIFY-CCPA-013 first_recorded_parity_score M11 fixtures/canonical/measured-parity.json exists with ≥5 fixtures; aggregate ≥ 0.95; flips contract DRAFT → ACTIVE_RUNTIME (DISCHARGED at 30 fixtures, aggregate 1.0000)
FALSIFY-CCPA-014 os_event_parity_bound M115.4 OS-level event parity (axis-2-closure-plan idea (2)): ccpa_differ::os_event_parity(teacher, student).score() ≥ 0.95 per fixture in fixtures/os-canonical/; bidirectional sensitivity: every fixture in fixtures/os-regression/ scores < 0.95 with non-empty drift records. Consumes ccpa_subproc::OsEvent records captured via ccpa-trace-subproc strace wrapper. (DISCHARGED at v1.25.0 / companion-repo M141 — 3 canonical + 1 regression fixtures, threshold 0.95)
FALSIFY-CCPA-015 ccpa_trace_subproc_output_purity M147 Every line emitted to stdout by ccpa-trace-subproc MUST decode as a ccpa_subproc::OsEvent JSON object; subprocess stdout MUST be redirected to Stdio::null() (not Stdio::inherit()) to prevent the wrapped process's prose from corrupting the capture stream. Test: cargo test -p ccpa-subproc --test falsify_ccpa_015_output_purity. (PROPOSED at v1.25.0 / M147; ACTIVE_RUNTIME at v1.26.0 / M164.)
FALSIFY-CCPA-016 outcome_parity_bound M152 Phase 3 P3.4 outcome parity: aggregate agreement on a MultiPL-E-Rust-class corpus ≥ 0.5 (POC-tier threshold); per-fixture exit-code consistency; bidirectional sensitivity via synthetic regression (< 0.5 fails) + synthetic identity (1.0 passes) fixtures. Source of truth: evidence/phase-3/multipl-e-rust-scores.json. Test: cargo test -p ccpa-differ --test falsify_ccpa_016_outcome_parity. (PROPOSED at v1.25.0 / M152; ACTIVE_RUNTIME at v1.26.0 / M164; current evidence: agreement = 1.0000 over 5 HumanEval/0..4 fixtures from companion-repo M150.)
FALSIFY-CCPA-017 project_scale_parity_bound M188 Phase 4 P4.4 project-scale parity: aggregate partial_agreement >= 0.3 AND files_jaccard_corpus >= 0.3 on a multi-file Cargo-workspace task corpus drawn from real GitHub issues (companion-repo M182: fixtures/project-scale/ initially 5 fixtures across paiml/decy + paiml/bashrs + paiml/depyler). Bidirectional sensitivity via synthetic identity (passes) + synthetic regression (fails) + empty-corpus (fails by design) + threshold-boundary fixtures. Source of truth: evidence/phase-4/project-scale-scores.json. Test: cargo test -p ccpa-differ --test falsify_ccpa_017_project_scale_parity (7 active synthetic + 1 #[ignore]'d live-evidence). (PROPOSED at v1.28.0 / M188 + M190; ACTIVE_RUNTIME pending first operator-dispatched bench via bash scripts/phase-4-bench.sh.)
FALSIFY-CCPA-018 arena_recovery_rate_bound M204 Phase 5 P5.4 Arena recovery-rate: aggregate recovery_rate >= 0.5 AND oracle_passed_rate >= 0.3 on the M182 project-scale fixture corpus driven via the live multi-turn Arena harness (crates/ccpa-arena/). recovery_rate := OraclePassed AND any_bash_failure_in_history — direct signal for design-audit.md M192 R3 (recovery over zero-shot determinism). The asymmetric give-up-fast synthetic fixture (100% pass rate BUT zero recovery) FAILS the gate, distinguishing CCPA-018 (agent quality) from CCPA-017 (functional outcome). Source of truth: evidence/phase-5/arena-scores.json. Test: cargo test -p ccpa-arena --test falsify_ccpa_018_arena_recovery_rate (7 active synthetic + 1 #[ignore]'d live-evidence). (PROPOSED at v1.29.0 / M204 + M208; ACTIVE_RUNTIME pending first operator-dispatched Arena bench via bash scripts/phase-5-arena-bench.sh.)
FALSIFY-CCPA-019 calibration_required_before_verdict M236 Calibration-required-before-verdict gate (Phase 5b harness hardening): any final outcome-parity verdict for CCPA-016/017/018 — when promoted PROPOSED → ACTIVE_RUNTIME, OR when an evidence file is treated as discharging the gate — MUST be preceded by a successful calibration run. A successful run = evidence/calibration/calibration-runs.json contains a record with identity_pass = true AND regression_fail = true AND passed_at within FRESHNESS_WINDOW_DAYS (30) of now. Codifies the M196-M224 root cause: 4-bug stack (apr-serve leak, claude permission denial, missing cwd, prose-vs-JSON parse mismatch) survived 14 milestones to M224 because every prior validation used MockDriver only. Bidirectional sensitivity (identity_pass AND regression_fail BOTH required) catches the degenerate "meter always passes" + "meter always fails" cases. Source of truth: evidence/calibration/calibration-runs.json. Test: cargo test -p ccpa-differ --test falsify_ccpa_019_calibration (7 active synthetic + 1 #[ignore]'d live-evidence). (PROPOSED at v1.31.0 / M236 companion-led; v1.32.0 / M270 aprender catch-up + mirror via aprender#1794; ACTIVE_RUNTIME when companion CI gate enforces this before any CCPA-016/017/018 ACTIVE_RUNTIME flip.)
FALSIFY-CCPA-020 contract_compliance_per_turn M258 Contract compliance per-turn gate (Phase 6 P6.5 under-contract methodology): any session marked ArenaOutcome::OraclePassed under the Phase 6 under-contract regime (ArenaSession::with_compliance(N) set, the under-contract dispatch path active) MUST have compliance_check.pmat_ok == true on EVERY ToolResult::FileMutated turn that carried a Some(ComplianceCheck). Non-pass outcomes (ComplianceFailed, ComplianceTrap, OracleFailedAfterMaxTurns, WallTimeout, DriverError) trivially satisfy the invariant — it only constrains the pass case. Phase 5 sessions (compliance_check = None on every FileMutated record) vacuously satisfy. Bidirectional sensitivity (per CCPA-019): identity case (clean-history-with-pass MUST satisfy) + regression case (pass-with-failing-compliance-turn MUST be falsified, represents a future regression where the loop accidentally accepts a pass despite mid-session compliance failures). Source of truth: evidence/under-contract/scores.json. Test: cargo test -p ccpa-arena --test falsify_ccpa_020_contract_compliance (7 active synthetic + 1 #[ignore]'d live-evidence). (PROPOSED at v1.32.0 / M270 via aprender#1794 squash ea2048b89; ACTIVE_RUNTIME pending first operator-dispatched Phase 6 bench producing evidence/under-contract/scores.json AND a CCPA-019 calibration record within freshness window.)

Each gate maps to one falsification test in crates/ccpa-*/tests/falsify_ccpa_NNN_*.rs and is enforced via pv validate contracts/claude-code-parity-apr-v1.yaml per the harness policy in CLAUDE.md § Contract Validation: DOGFOOD pv, NEVER bash. No bash/yq/python re-implementation of these gates is permitted. If pv validate does not yet support a needed shape, extend aprender-contracts/src/schema/ — schema-extension ticket: PMAT-CONTRACTS-CCPA-001.