Falsification conditions (20 gates total)

Top spec: claude-code-parity-apr-poc.md

20 falsifiable gates: 4 source-of-truth invariants (CCPA-009..012, M0+) and 16 behavioral parity / process gates (CCPA-001..008 + CCPA-013, M1..M11; CCPA-014, M115.4 axis-2-closure-plan; CCPA-015 + CCPA-016, M147 + M152 Phase 3 outcome-parity; CCPA-017, M188 Phase 4 project-scale parity; CCPA-018, M204 Phase 5 Arena recovery-rate; CCPA-019, M236 calibration-required-before-verdict; CCPA-020, M258 Phase 6 contract-compliance-per-turn). All asserted via pv validate contracts/claude-code-parity-apr-v1.yaml per CLAUDE.md § Contract Validation: DOGFOOD pv, NEVER bash. See invariants.md for invariants 1-4.

Falsification conditions (20 gates total)

Source-of-truth invariants (M0+)

ID	Name	Phase	Mechanically asserted by
FALSIFY-CCPA-009	ci_main_branch_green	M0+	`gh api repos/paiml/claude-code-parity-apr/branches/main/protection` returns `ci/gate` ∈ required contexts
FALSIFY-CCPA-010	pmat_comply_100pct	M0+	`pmat comply check --json` returns `compliance_pct == 100.0 ∧ total_violations == 0`
FALSIFY-CCPA-011	line_coverage_100pct	M0+	`cargo llvm-cov --fail-under-lines 100 --fail-uncovered-lines 0` exits 0
FALSIFY-CCPA-012	pv_contract_gate_on_commit	M0+	pre-commit hook + CI both run `pv validate contracts/claude-code-parity-apr-v1.yaml`, exit 0

Behavioral parity gates (M1..M11)

ID	Name	Phase	Assertion summary
FALSIFY-CCPA-001	trace_schema_roundtrip	M1	every fixture parses, re-serializes byte-identical, validates against `trace_schema`
FALSIFY-CCPA-002	replay_determinism	M3	replaying same fixture twice → byte-identical student traces (after normalization)
FALSIFY-CCPA-003	mock_completeness	M3	`RecordedDriver` consumes exactly len(teacher.assistant_turns) responses; no missing, no extras
FALSIFY-CCPA-004	tool_call_equivalence	M4	per turn, multiset of `(tool_name, semantic_input)` pairs in student matches teacher under per-tool equivalence rules (Edit: post-state sha256; Bash: normalized command; etc.)
FALSIFY-CCPA-005	file_mutation_equivalence	M4	union diff over CWD after `apr code` finishes equals union diff after Claude Code finished, modulo per-filetype canonicalization
FALSIFY-CCPA-006	sovereignty_on_replay	M5	zero outbound sockets to `*.anthropic.com` during replay; CI test container drops all egress except 127.0.0.1
FALSIFY-CCPA-007	corpus_coverage	M5	≥1 fixture per non-MISSING row of `apr-code-parity-v1.yaml` (currently 17 of 21)
FALSIFY-CCPA-008	parity_score_bound	M6	aggregate `parity_score ≥ 0.95` and per-fixture `≥ 0.80`. SOFT-DEPRECATED at M230 + status: ADVISORY at v1.30.0 / M232 (post-M224 Popperian StaticFalsified verdict): the gate still enforces the score threshold on the 30 AUTHORED canonical fixtures, but the 1.0000 result is now interpreted as meter validation (the differ correctly recognizes equivalent traces), NOT system-level parity validation. The user-facing parity claims move to CCPA-016 (function-scale outcome) + CCPA-017 (project-scale partial-progress) + CCPA-018 (Arena recovery-rate). See static-fixture-deprecation.md for the full reframe + audit trail. Contract status field is annotated `ADVISORY` in the CCPA-008 summary at upstream aprender v1.30.0 (aprender#1735).
FALSIFY-CCPA-013	first_recorded_parity_score	M11	`fixtures/canonical/measured-parity.json` exists with ≥5 fixtures; aggregate ≥ 0.95; flips contract DRAFT → ACTIVE_RUNTIME (DISCHARGED at 30 fixtures, aggregate 1.0000)
FALSIFY-CCPA-014	os_event_parity_bound	M115.4	OS-level event parity (axis-2-closure-plan idea (2)): `ccpa_differ::os_event_parity(teacher, student).score() ≥ 0.95` per fixture in `fixtures/os-canonical/`; bidirectional sensitivity: every fixture in `fixtures/os-regression/` scores `< 0.95` with non-empty drift records. Consumes `ccpa_subproc::OsEvent` records captured via `ccpa-trace-subproc` strace wrapper. (DISCHARGED at v1.25.0 / companion-repo M141 — 3 canonical + 1 regression fixtures, threshold 0.95)
FALSIFY-CCPA-015	ccpa_trace_subproc_output_purity	M147	Every line emitted to stdout by `ccpa-trace-subproc` MUST decode as a `ccpa_subproc::OsEvent` JSON object; subprocess stdout MUST be redirected to `Stdio::null()` (not `Stdio::inherit()`) to prevent the wrapped process's prose from corrupting the capture stream. Test: `cargo test -p ccpa-subproc --test falsify_ccpa_015_output_purity`. (PROPOSED at v1.25.0 / M147; ACTIVE_RUNTIME at v1.26.0 / M164.)
FALSIFY-CCPA-016	outcome_parity_bound	M152	Phase 3 P3.4 outcome parity: aggregate `agreement` on a MultiPL-E-Rust-class corpus ≥ 0.5 (POC-tier threshold); per-fixture exit-code consistency; bidirectional sensitivity via synthetic regression (`< 0.5` fails) + synthetic identity (`1.0` passes) fixtures. Source of truth: `evidence/phase-3/multipl-e-rust-scores.json`. Test: `cargo test -p ccpa-differ --test falsify_ccpa_016_outcome_parity`. (PROPOSED at v1.25.0 / M152; ACTIVE_RUNTIME at v1.26.0 / M164; current evidence: agreement = 1.0000 over 5 HumanEval/0..4 fixtures from companion-repo M150.)
FALSIFY-CCPA-017	project_scale_parity_bound	M188	Phase 4 P4.4 project-scale parity: aggregate `partial_agreement >= 0.3` AND `files_jaccard_corpus >= 0.3` on a multi-file Cargo-workspace task corpus drawn from real GitHub issues (companion-repo M182: `fixtures/project-scale/` initially 5 fixtures across paiml/decy + paiml/bashrs + paiml/depyler). Bidirectional sensitivity via synthetic identity (passes) + synthetic regression (fails) + empty-corpus (fails by design) + threshold-boundary fixtures. Source of truth: `evidence/phase-4/project-scale-scores.json`. Test: `cargo test -p ccpa-differ --test falsify_ccpa_017_project_scale_parity` (7 active synthetic + 1 `#[ignore]`'d live-evidence). (PROPOSED at v1.28.0 / M188 + M190; ACTIVE_RUNTIME pending first operator-dispatched bench via `bash scripts/phase-4-bench.sh`.)
FALSIFY-CCPA-018	arena_recovery_rate_bound	M204	Phase 5 P5.4 Arena recovery-rate: aggregate `recovery_rate >= 0.5` AND `oracle_passed_rate >= 0.3` on the M182 project-scale fixture corpus driven via the live multi-turn Arena harness (`crates/ccpa-arena/`). `recovery_rate := OraclePassed AND any_bash_failure_in_history` — direct signal for design-audit.md M192 R3 (recovery over zero-shot determinism). The asymmetric give-up-fast synthetic fixture (100% pass rate BUT zero recovery) FAILS the gate, distinguishing CCPA-018 (agent quality) from CCPA-017 (functional outcome). Source of truth: `evidence/phase-5/arena-scores.json`. Test: `cargo test -p ccpa-arena --test falsify_ccpa_018_arena_recovery_rate` (7 active synthetic + 1 `#[ignore]`'d live-evidence). (PROPOSED at v1.29.0 / M204 + M208; ACTIVE_RUNTIME pending first operator-dispatched Arena bench via `bash scripts/phase-5-arena-bench.sh`.)
FALSIFY-CCPA-019	calibration_required_before_verdict	M236	Calibration-required-before-verdict gate (Phase 5b harness hardening): any final outcome-parity verdict for CCPA-016/017/018 — when promoted PROPOSED → ACTIVE_RUNTIME, OR when an evidence file is treated as discharging the gate — MUST be preceded by a successful calibration run. A successful run = `evidence/calibration/calibration-runs.json` contains a record with `identity_pass = true` AND `regression_fail = true` AND `passed_at` within `FRESHNESS_WINDOW_DAYS` (30) of now. Codifies the M196-M224 root cause: 4-bug stack (apr-serve leak, claude permission denial, missing cwd, prose-vs-JSON parse mismatch) survived 14 milestones to M224 because every prior validation used MockDriver only. Bidirectional sensitivity (identity_pass AND regression_fail BOTH required) catches the degenerate "meter always passes" + "meter always fails" cases. Source of truth: `evidence/calibration/calibration-runs.json`. Test: `cargo test -p ccpa-differ --test falsify_ccpa_019_calibration` (7 active synthetic + 1 `#[ignore]`'d live-evidence). (PROPOSED at v1.31.0 / M236 companion-led; v1.32.0 / M270 aprender catch-up + mirror via aprender#1794; ACTIVE_RUNTIME when companion CI gate enforces this before any CCPA-016/017/018 ACTIVE_RUNTIME flip.)
FALSIFY-CCPA-020	contract_compliance_per_turn	M258	Contract compliance per-turn gate (Phase 6 P6.5 under-contract methodology): any session marked `ArenaOutcome::OraclePassed` under the Phase 6 under-contract regime (`ArenaSession::with_compliance(N)` set, the under-contract dispatch path active) MUST have `compliance_check.pmat_ok == true` on EVERY `ToolResult::FileMutated` turn that carried a `Some(ComplianceCheck)`. Non-pass outcomes (`ComplianceFailed`, `ComplianceTrap`, `OracleFailedAfterMaxTurns`, `WallTimeout`, `DriverError`) trivially satisfy the invariant — it only constrains the pass case. Phase 5 sessions (`compliance_check = None` on every `FileMutated` record) vacuously satisfy. Bidirectional sensitivity (per CCPA-019): identity case (clean-history-with-pass MUST satisfy) + regression case (pass-with-failing-compliance-turn MUST be falsified, represents a future regression where the loop accidentally accepts a pass despite mid-session compliance failures). Source of truth: `evidence/under-contract/scores.json`. Test: `cargo test -p ccpa-arena --test falsify_ccpa_020_contract_compliance` (7 active synthetic + 1 `#[ignore]`'d live-evidence). (PROPOSED at v1.32.0 / M270 via aprender#1794 squash `ea2048b89`; ACTIVE_RUNTIME pending first operator-dispatched Phase 6 bench producing `evidence/under-contract/scores.json` AND a CCPA-019 calibration record within freshness window.)

Each gate maps to one falsification test in crates/ccpa-*/tests/falsify_ccpa_NNN_*.rs and is enforced via pv validate contracts/claude-code-parity-apr-v1.yaml per the harness policy in CLAUDE.md § Contract Validation: DOGFOOD pv, NEVER bash. No bash/yq/python re-implementation of these gates is permitted. If pv validate does not yet support a needed shape, extend aprender-contracts/src/schema/ — schema-extension ticket: PMAT-CONTRACTS-CCPA-001.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Falsification conditions (20 gates total)