Skip to content

Latest commit

 

History

History
161 lines (109 loc) · 14.8 KB

File metadata and controls

161 lines (109 loc) · 14.8 KB

Outcome-parity plan (M149, 2026-05-12)

Top spec: claude-code-parity-apr-poc.md | Phase 2 execution plan | Axis-2 closure plan | Completeness assessment

Operator reframe (2026-05-12): "so we can ask apr code to generate same code as claude code and 'it works'".

This shifts the parity question from procedural parity (do both systems make the same syscalls? — measured by OS-level Jaccard at M148 = 0.3333) to outcome parity (do both systems produce code that does the same thing? — measured by "does it compile and pass a test?"). Outcome parity is closer to what a user actually cares about: "I asked for X; did I get something that does X?".

Re-positioning the 5 closure-plan ideas

The axis-2-closure-plan.md M113 brainstorm evaluated 5 paths:

# Idea Measures Status
1 HTTPS proxy API-level message traces Phase 1 OOS (M118 prior art validates feasibility)
2 CLI subprocess instrumentation (strace) OS-level syscalls — procedural parity Phase 2 shipped M136-M148; first score 0.3333
3 SWE-bench differential evaluation Outcome parity on real GitHub issues M149 reframe → primary direction
4 Metamorphic relations Behavioral invariants Backlog
5 Statistical fingerprint Population-level skew Cheap health check

The operator's reframe elevates idea (3) — but simpler than SWE-bench: instead of 2,294 curated GitHub issues, start with a tiny self-contained code-gen prompt set, run both systems, run the generated code, check it works. This is idea (3.1) lightweight outcome parity — the cheapest path to a user-meaningful parity number.

Why outcome parity is the right primary measure

Procedural parity (M148 score = 0.3333) tells us what each system did at OS level. That's diagnostic. But the parity claim users actually want is:

"If I ask apr code to do X, will I get the same outcome as if I ask claude code to do X?"

That's outcome parity. Two systems can have wildly different procedural traces (different libc, different exec patterns, different tmp-file naming) and still produce identical outcomes — the user wouldn't see the difference. Conversely, two systems can have nearly-identical OS traces and still produce different outcomes if they make different decisions inside their respective LLMs.

The 0.3333 OS-event score is interesting but not load-bearing for the "does apr code work the same as claude code" claim. Outcome parity is load-bearing.

What "outcome parity" means concretely

For a code-generation prompt (the bread-and-butter agent task):

  1. Functional outcome: does the generated code compile + pass a basic test?
  2. Semantic outcome: does it do what the prompt asked (verified by hand or by an oracle)?
  3. Diff outcome: are the generated diffs byte-identical or semantically equivalent (per arXiv:2310.06770 SWE-bench methodology)?

The cheapest first test: (1) functional outcome. Both systems generate code → both outputs are run through a compile + test harness → score = fraction of prompts where both pass.

P3 plan — Outcome-parity sub-deliverables

(Numbered P3.x to distinguish from P2.x OS-level work.)

P3.1 — Outcome prompt corpus (M150)

Distinct from M144's OS-event prompts. Each fixture asks for VERIFIABLE code generation:

  • 0001-fib — "Write a Rust function fib(n: u32) -> u64 returning the nth Fibonacci number. Include #[test] asserting fib(10) == 55."
  • 0002-palindrome — "Write a function is_palindrome(s: &str) -> bool. Tests: 'racecar' → true, 'hello' → false."
  • 0003-fizzbuzz — "Write a function fizzbuzz(n: u32) -> Vec<String> returning lines 1..=n with the standard FizzBuzz rules."
  • 0004-binary-search — "Write bsearch(haystack: &[i32], needle: i32) -> Option<usize> using binary search."
  • 0005-multi-step — composite: "Write a small CLI in Rust that reads numbers from stdin, sorts them, and prints them. Include tests."

Each fixtures/outcome-prompts/<id>/:

  • prompt.txt — natural-language ask
  • oracle/Cargo.toml + oracle/src/lib.rs — REFERENCE solution that the test harness uses as a CHECK (separate from what the systems generate)
  • cwd-tree/ — starting state (empty for greenfield prompts)

P3.2 — Outcome runner (M151, operator-dispatched)

for p in fixtures/outcome-prompts/*/; do
    id=$(basename "$p")
    for system in claude apr; do
        cwd="/tmp/outcome-$id-$system"
        mkdir -p "$cwd"
        cp -r "$p/cwd-tree/." "$cwd/"
        cd "$cwd"
        "${system}" code -p "$(cat $p/prompt.txt)"
        cargo test --quiet 2>&1 > "$p/${system}.outcome.txt"
        echo $? > "$p/${system}.outcome.exit_code"
    done
done

For each prompt × system: collect compile-and-test exit code. Score = fraction where both system's generated code passes its own embedded tests.

P3.3 — Cross-output equivalence (M153 SHIPPED — line-set Jaccard only)

Partial-SHIPPED at M153 as crates/ccpa-differ/src/outcome_diff.rs (cross_output_equivalence()) + crates/ccpa-differ/tests/phase_3_cross_output_equivalence.rs (integration test against live captures). Ships line-set Jaccard over trimmed non-empty lines — the simplest semantically-meaningful similarity metric, robust to whitespace and comment-only churn. Live M150 evidence aggregate: 0.5201 over 5 fixtures (range 0.33–0.83); evidence checked in at evidence/phase-3/cross-output-equivalence.json. Honest interpretation: the README's "nearly-byte-identical" claim about HumanEval_3 was a 1-fixture cherrypick; aggregate similarity sits barely above the 0.5 POC threshold because HumanEval_1 (0.38), HumanEval_2 (0.45), and HumanEval_4 (0.33) show meaningful structural divergence between claude's and apr code's solutions even when both pass.

Future P3.3 sub-deliverables (NOT in M153):

  • Files-touched Jaccard: which paths got written? (needs P2.3 OS-event captures, not just generated .src.rs)
  • Test-survival rate: if we swap the test files between the two outputs, do both still pass? SHIPPED at M154scripts/phase-3-test-survival.sh + crates/ccpa-differ/tests/phase_3_test_survival_gate.rs + evidence/phase-3/test-survival.json. Live result: survival_rate = 1.0000 (10/10 swaps pass on the 5-fixture M150 corpus). Proves the M153 structural divergence is purely STYLISTIC — every test from either system passes against either implementation.
  • Diff similarity: text-level Levenshtein or AST-level diff between the two outputs (heavier weight; line-set Jaccard is the M153 stand-in)

P3.4 — FALSIFY-CCPA-016 outcome parity gate (M152 SHIPPED)

SHIPPED at M152 as crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs. Threshold = 0.5 (POC-tier per the original P3.4 design note); 4 test assertions cover live-evidence pass + per-fixture exit-code consistency + synthetic-regression-below-threshold (bidirectional sensitivity) + synthetic-identity-passes (false-positive guard). Gate is DRAFT until P3.5 / M153+ contract bump registers CCPA-016 in claude-code-parity-apr-v1.yaml.

P3.5 — Contract bump (M164 SHIPPED — v1.25.0 → v1.26.0)

SHIPPED at M164 (2026-05-13) via the M22 5-step ritual mirroring aprender#1665 (squash 9cbac28b5). v1.26.0 promotes both CCPA-015 (output purity, was PROPOSED at M147) and CCPA-016 (outcome parity bound, was PROPOSED at M152) from PROPOSED → ACTIVE_RUNTIME in the canonical gate registry. Gate count: 14 → 16.

P3.6 — Project-scale outcome parity (M161+ future-work)

Prior art: ProgramBench (Yang et al., Meta FAIR + Stanford + Harvard, 2026 — added at M159). 200 real-world programs (FFmpeg / SQLite / PHP interpreter / etc.) where LMs reconstruct a codebase from only the executable + documentation, scored via coverage-guided fuzzing-generated behavioral tests. Headline result: 0% of 200 tasks fully resolved; best model 95%-test-pass on 3% of tasks.

Relevance to CCPA: ProgramBench extends our M154 test-survival pattern from function-level (HumanEval/0..4) to project-level (rebuild SQLite). The methodology is:

  1. Reference: compile gold executable from a real GitHub repo.
  2. Test generation: agent-driven systematic exploration + coverage-guided iteration produces a behavioral-equivalence test suite from the executable.
  3. Reconstruction task: hand the candidate LM only the executable + docs; require a working codebase.
  4. Evaluation: candidate codebase passes/fails the hidden test suite.

Why this is the natural M161+ direction: M150-M154 demonstrated outcome parity is real and measurable on function-level tasks (5 HumanEval problems, agreement 1.0000, test-survival 1.0000). The user-facing parity question scales up: "does apr code rebuild SQLite as well as claude code does?" That question is exactly ProgramBench's question for the multi-model case. M161+ would adapt their pipeline to bilateral CCPA bench (teacher = claude, student = apr code, both run on the same ProgramBench task; scored against the same hidden test suite).

Future gate: hypothetical FALSIFY-CCPA-017 — project-scale outcome-parity bound on a ProgramBench-class corpus. Threshold TBD; if M161+ surfaces ProgramBench's 0% saturation pattern, CCPA-017 would gate at "any non-zero pass" (POC) or "match within 1pp of claude" (production-tier).

Caveats inherited from ProgramBench:

  • Project-scale tasks are operator-dispatched (~6h wall-clock per fixture per system in ProgramBench).
  • Test-suite quality matters: ProgramBench reports 3.7% dummy-pass-rate with linting vs 18.5% without — naïve test suites give false positives.
  • Architectural divergence is the rule, not exception: ProgramBench found 67% of models prefer shallower directory depths than reference code; single-file monolithic implementations dominate. The M153 line-set Jaccard (0.5201 on functions) would likely score even lower at project scale.

Comparing the two parity tracks

Track Measures First number What it tells us
P2 OS-level Jaccard Syscall set overlap 0.3333 (M148) Procedural divergence; diagnostic
P3 outcome parity Does generated code work? TBD (M150-M153) User-facing equivalence; load-bearing for "apr code works like claude code"

Both have value. P2 catches procedural drift (e.g. "apr writes to a different tmp dir") — useful for debugging. P3 catches outcome drift (e.g. "apr's code doesn't compile") — useful for shipping. P3 is what matters for the operator's question.

Implementation blocker (M150-M154 update — RESOLVED via aprender#1638)

Original M149 framing (preserved as audit trail): "Both P2.3 capture AND P3.2 outcome runner need apr code to actually exist as an invokable CLI. That's M3.1 / PMAT-CODE-LLM-DRIVER-PUBLIC-001 — pending in aprender. Until that lands, P3 sub-deliverables M151+ stay in 'designed, not run' state."

M150 finding (2026-05-12): this framing was WRONG. M3.1 / PMAT-CODE-LLM-DRIVER-PUBLIC-001 is about LlmDriver trait visibility (pub(crate)pub), which was already satisfied. The real blocker was a feature-flag config in apr-cli/Cargo.tomlcode = ["dep:batuta"] made the Code { ... } subcommand variant gated behind --features code, so the default cargo install apr-cli produced a binary WITHOUT apr code. aprender#1638 removes the gate (code = [] + batuta becomes non-optional); default build now ships apr code. M150-M154 SHIPPED the full Phase 3 outcome-parity sequence using a locally-built apr with this flag removed — the upstream PR is the formalization, not a prerequisite.

Phase 3 sub-deliverable status post-M210:

  • M150 P3.1 outcome-prompt corpus: SHIPPED at fixtures/multipl-e-rust/ (initially 5 HumanEval fixtures; M168 extended to 21 fixtures HumanEval/0..20).
  • M150 P3.2 outcome runner: SHIPPED at scripts/phase-3-bench.sh — real bilateral bench produced evidence/phase-3/multipl-e-rust-scores.json with agreement = 1.0000 on the original 5 fixtures; recalibrated curve awaits next operator dispatch on the 21-fixture corpus.
  • M153 P3.3 line-set Jaccard: SHIPPED at crates/ccpa-differ/src/outcome_diff.rs + evidence/phase-3/cross-output-equivalence.json (aggregate 0.5201).
  • M154 P3.3 test-survival: SHIPPED at scripts/phase-3-test-survival.sh + evidence/phase-3/test-survival.json (1.0000 — 10/10 cross-swaps pass).
  • M152 P3.4 FALSIFY-CCPA-016 gate: SHIPPED at crates/ccpa-differ/tests/falsify_ccpa_016_outcome_parity.rs (threshold 0.5, bidirectional sensitivity).
  • M164 P3.5 contract bump v1.25.0 → v1.26.0: SHIPPED via M22 5-step ritual (aprender#1665 squash 9cbac28b5); CCPA-015 + CCPA-016 promoted PROPOSED → ACTIVE_RUNTIME.
  • M167 contract bump v1.26.0 → v1.27.0: SHIPPED (aprender#1666 squash e4b673336); CCPA-013 OPEN → ACTIVE_RUNTIME + fixture_corpus_path extended to accept evidence/phase-3/captures/.
  • M172 fixture corpus structural validation: SHIPPED at crates/ccpa-differ/tests/fixture_corpus_structure.rs (4 tests; <1s default cargo test).
  • M174 fixture deep-correctness validator: SHIPPED at scripts/validate-fixtures.sh (21/21 PASS in ~30s).
  • M176 pre-commit hook integration: SHIPPED in scripts/install-hooks.sh (conditional fire on fixture-touching commits).

Remaining Phase 3 future work (M178+):

  • Operator-dispatched recalibrated bench: bash scripts/phase-3-bench.sh on the 21-fixture corpus produces the calibrated pass@1 + agreement curve (M168 corpus now ready).
  • Bench expansion 21 → full 164 MultiPL-E-Rust problems (operator-dispatched ~5-10 hour wall on a fast host).
  • AST-level diff sub-metric (line-set Jaccard is the M153 stand-in; syn-based AST diff is heavier-signal future work).
  • P3.6 project-scale parity gate (CCPA-017 candidate) — ProgramBench prior-art (M159) at arXiv:2605.03546 reports 0%/200 fully-resolved across Claude Opus/Sonnet/Haiku + GPT + Gemini, validating the "function-level 1.0 does not extrapolate to project-scale" caveat.

Cross-refs

  • phase-2-execution-plan.md — P2.x OS-level (procedural)
  • axis-2-closure-plan.md — original M113 5-idea brainstorm
  • completeness-assessment.md — § "Are we at parity with Claude Code?"
  • evidence/phase-2/measured-os-parity.json — first procedural parity number (M148)
  • aprender PMAT-CODE-LLM-DRIVER-PUBLIC-001 — the blocker for both P2.3 student capture AND P3.2 outcome run (M150 finding: WAS NOT the actual blocker — see § Implementation blocker above)
  • aprender#1638 — feature-flag-removal PR that ships apr code in default cargo install apr-cli build (the real blocker for upstream ergonomics; locally workaroundable)