The claude-code-parity-apr (CCPA) project has successfully implemented a rigorous, falsifiable record-replay-distill harness (M0-M190). The pivot from "procedural parity" (OS-level traces) to "outcome parity" (M149) correctly identified that users value functional outcomes over identical system calls. However, as the project scales towards multi-file, project-scale outcome parity (M188+), the existing static-fixture infrastructure risks becoming an engineering bottleneck rather than an accelerator.
- Hinton et al. 2015 (arXiv:1503.02531) - Distilling the Knowledge in a Neural Network:
- Application: The project's foundational distillation framing.
- Critique: While intellectually satisfying, distilling orchestration traces via static fixtures lacks the dynamic feedback of true model distillation.
- Cassano et al. 2022 (arXiv:2208.08227) - MultiPL-E: A Scalable and Polyglot Approach to Evaluating Neural Code Generation:
- Application: Successfully utilized in M150 to prove 1.0000 function-level outcome parity.
- Jimenez et al. 2023 (arXiv:2310.06770) - SWE-bench: Can Language Models Resolve Real-World GitHub Issues?:
- Application: Used as justification for differential evaluation, but highlighted that real tasks are non-deterministic and require execution feedback.
- Yang et al. 2026 (arXiv:2605.03546) - ProgramBench: Evaluating Language Models at the Project Scale:
- Application: Crucial prior art demonstrating the cliff between function-level parity and project-scale reality (0% full resolution for top models). This suggests project-scale parity requires dynamic, multi-turn exploration, not just single-turn determinism.
The project relies on a RecordedDriver to artificially force deterministic replay of a "teacher's" steps.
Code Example: crates/ccpa-replayer/src/driver.rs
impl LlmDriver for RecordedDriver {
fn next_turn(&mut self) -> Result<NextTurn, ReplayError> {
match self.turns.get(self.cursor) {
Some(turn) => {
let out = turn.clone();
self.cursor = self.cursor.saturating_add(1);
Ok(out)
}
None => Err(ReplayError::DriverExhausted {
position: self.cursor,
total: self.turns.len(),
}),
}
}
// ...
}Five-Whys Root Cause Analysis:
- Why does
RecordedDriverfail withDriverExhausted? Because theapr codeagent made an unexpected tool call or didn't follow the exact same reasoning trajectory as the teacher. - Why did it make an unexpected tool call? Because real LLM execution is non-deterministic, and different models (e.g., Qwen vs Claude) solve problems differently even if both are successful.
- Why does the test fail when this happens? Because
FALSIFY-CCPA-003(mock completeness) strictly mandates that the student consumes exactly the recorded teacher turns. - Why does the project mandate exact consumption? Because the original M0 distillation framing (Hinton) assumed we needed to isolate "orchestration drift" by holding the LLM constant.
- Why is isolating orchestration drift a problem now? Because it overfits the agent to a single "golden path", penalizing valid self-correction and alternative valid strategies, which are essential for project-scale convergence (Yang et al. 2026).
The project measures "structural equivalence" as a proxy for parity, but these metrics often miss the semantic outcome.
Code Example: crates/ccpa-differ/src/outcome_diff.rs
pub fn cross_output_equivalence(teacher: &str, student: &str) -> CrossOutputReport {
let teacher_set: std::collections::BTreeSet<&str> = teacher.lines().map(str::trim).filter(|l| !l.is_empty()).collect();
let student_set: std::collections::BTreeSet<&str> = student.lines().map(str::trim).filter(|l| !l.is_empty()).collect();
let common = teacher_set.intersection(&student_set).count();
let union = teacher_set.union(&student_set).count();
let lines_jaccard = if union == 0 { 1.0 } else { (common as f64) / (union as f64) };
// ...
}Five-Whys Root Cause Analysis:
- Why does line-set Jaccard score
0.33even when both tests pass (e.g. M153)? Because the student and teacher wrote different implementations (different variable names, different logic structure). - Why does the Jaccard score penalize this? Because it is a purely structural metric comparing exact text lines rather than semantic behavior.
- Why use a structural metric for generated code? Because calculating true semantic equivalence (e.g., via formal verification or exhaustive fuzzing) is computationally expensive and difficult to build.
- Why not just rely on the tests passing? Because early phases (M148) focused on "procedural parity" (how it was done) rather than "outcome parity" (did it work).
- Why is this metric distracting? Because users care about functional code (Cassano et al. 2022). Engineering effort spent optimizing Jaccard similarity or OS-event Jaccard is wasted if the fundamental goal is simply to have a working, test-verified output.
- Extreme Rigor: The 17 mechanical gates (
pv validate) enforce a highly disciplined engineering culture. - Clear Lexicon: The distinct vocabulary around "procedural parity", "outcome parity", and "structural equivalence" provides excellent analytical clarity.
- Overfitting to Golden Paths: The M2.3 rescope away from a live HTTPS proxy to hand-authored canonical fixtures risks overfitting.
- Diminishing Returns on Structural Parity: Metrics like OS-event Jaccard and line-set Jaccard are interesting but not load-bearing for user success.
Hypothesis: The static, mock-driven RecordedDriver replay infrastructure accurately predicts the live, end-to-end success rates of apr code on real multi-step engineering tasks.
Falsification Condition: If apr code achieves > 0.95 parity scores on the static canonical fixtures (FALSIFY-CCPA-008), but simultaneously scores near 0.0 on live, project-scale outcome benchmarks (e.g., ProgramBench tasks per FALSIFY-CCPA-017), the hypothesis is falsified. The static fixture approach would be proven inadequate as a predictive convergence metric, exposing that dynamic recovery and live LLM feedback loops are the true drivers of project-scale parity.
To more quickly converge on a working, project-scale solution, we recommend the following tactical shifts:
- Deprecate Procedural/Structural Gates: Soft-deprecate
FALSIFY-CCPA-014(OS-event parity). Accept thatapr codesolves problems structurally differently than Claude Code. Focus exclusively on functional test survival. - Pivot to Live Continuous Evaluation: Stop hand-authoring canonical JSONL traces. Reallocate engineering cycles from
ccpa-replayermaintenance to a live "Arena" runner (SWE-Bench / ProgramBench style). End-to-end execution, even if non-deterministic, provides higher-fidelity signal for convergence. - Prioritize Error Recovery over Zero-Shot Determinism: Shift the evaluation focus from zero-shot trajectory matching to the agent's ability to recover from failed bash commands or test runs. Real-world convergence depends on self-correction, which static traces cannot evaluate.