Skip to content

Latest commit

 

History

History
93 lines (75 loc) · 7.58 KB

File metadata and controls

93 lines (75 loc) · 7.58 KB

Design Audit: Tactical Convergence & Project-Scale Parity

1. Background & Motivation

The claude-code-parity-apr (CCPA) project has successfully implemented a rigorous, falsifiable record-replay-distill harness (M0-M190). The pivot from "procedural parity" (OS-level traces) to "outcome parity" (M149) correctly identified that users value functional outcomes over identical system calls. However, as the project scales towards multi-file, project-scale outcome parity (M188+), the existing static-fixture infrastructure risks becoming an engineering bottleneck rather than an accelerator.

2. Citations & Academic Basis

  • Hinton et al. 2015 (arXiv:1503.02531) - Distilling the Knowledge in a Neural Network:
    • Application: The project's foundational distillation framing.
    • Critique: While intellectually satisfying, distilling orchestration traces via static fixtures lacks the dynamic feedback of true model distillation.
  • Cassano et al. 2022 (arXiv:2208.08227) - MultiPL-E: A Scalable and Polyglot Approach to Evaluating Neural Code Generation:
    • Application: Successfully utilized in M150 to prove 1.0000 function-level outcome parity.
  • Jimenez et al. 2023 (arXiv:2310.06770) - SWE-bench: Can Language Models Resolve Real-World GitHub Issues?:
    • Application: Used as justification for differential evaluation, but highlighted that real tasks are non-deterministic and require execution feedback.
  • Yang et al. 2026 (arXiv:2605.03546) - ProgramBench: Evaluating Language Models at the Project Scale:
    • Application: Crucial prior art demonstrating the cliff between function-level parity and project-scale reality (0% full resolution for top models). This suggests project-scale parity requires dynamic, multi-turn exploration, not just single-turn determinism.

3. Implementation Critiques & Code Examples

3.1. The Heavy Mock Infrastructure (ccpa-replayer)

The project relies on a RecordedDriver to artificially force deterministic replay of a "teacher's" steps.

Code Example: crates/ccpa-replayer/src/driver.rs

impl LlmDriver for RecordedDriver {
    fn next_turn(&mut self) -> Result<NextTurn, ReplayError> {
        match self.turns.get(self.cursor) {
            Some(turn) => {
                let out = turn.clone();
                self.cursor = self.cursor.saturating_add(1);
                Ok(out)
            }
            None => Err(ReplayError::DriverExhausted {
                position: self.cursor,
                total: self.turns.len(),
            }),
        }
    }
    // ...
}

Five-Whys Root Cause Analysis:

  1. Why does RecordedDriver fail with DriverExhausted? Because the apr code agent made an unexpected tool call or didn't follow the exact same reasoning trajectory as the teacher.
  2. Why did it make an unexpected tool call? Because real LLM execution is non-deterministic, and different models (e.g., Qwen vs Claude) solve problems differently even if both are successful.
  3. Why does the test fail when this happens? Because FALSIFY-CCPA-003 (mock completeness) strictly mandates that the student consumes exactly the recorded teacher turns.
  4. Why does the project mandate exact consumption? Because the original M0 distillation framing (Hinton) assumed we needed to isolate "orchestration drift" by holding the LLM constant.
  5. Why is isolating orchestration drift a problem now? Because it overfits the agent to a single "golden path", penalizing valid self-correction and alternative valid strategies, which are essential for project-scale convergence (Yang et al. 2026).

3.2. Structural Equivalence Scoring (ccpa-differ)

The project measures "structural equivalence" as a proxy for parity, but these metrics often miss the semantic outcome.

Code Example: crates/ccpa-differ/src/outcome_diff.rs

pub fn cross_output_equivalence(teacher: &str, student: &str) -> CrossOutputReport {
    let teacher_set: std::collections::BTreeSet<&str> = teacher.lines().map(str::trim).filter(|l| !l.is_empty()).collect();
    let student_set: std::collections::BTreeSet<&str> = student.lines().map(str::trim).filter(|l| !l.is_empty()).collect();

    let common = teacher_set.intersection(&student_set).count();
    let union = teacher_set.union(&student_set).count();
    let lines_jaccard = if union == 0 { 1.0 } else { (common as f64) / (union as f64) };
    // ...
}

Five-Whys Root Cause Analysis:

  1. Why does line-set Jaccard score 0.33 even when both tests pass (e.g. M153)? Because the student and teacher wrote different implementations (different variable names, different logic structure).
  2. Why does the Jaccard score penalize this? Because it is a purely structural metric comparing exact text lines rather than semantic behavior.
  3. Why use a structural metric for generated code? Because calculating true semantic equivalence (e.g., via formal verification or exhaustive fuzzing) is computationally expensive and difficult to build.
  4. Why not just rely on the tests passing? Because early phases (M148) focused on "procedural parity" (how it was done) rather than "outcome parity" (did it work).
  5. Why is this metric distracting? Because users care about functional code (Cassano et al. 2022). Engineering effort spent optimizing Jaccard similarity or OS-event Jaccard is wasted if the fundamental goal is simply to have a working, test-verified output.

4. Pro/Con of Current Approach

Pros

  • Extreme Rigor: The 17 mechanical gates (pv validate) enforce a highly disciplined engineering culture.
  • Clear Lexicon: The distinct vocabulary around "procedural parity", "outcome parity", and "structural equivalence" provides excellent analytical clarity.

Cons

  • Overfitting to Golden Paths: The M2.3 rescope away from a live HTTPS proxy to hand-authored canonical fixtures risks overfitting.
  • Diminishing Returns on Structural Parity: Metrics like OS-event Jaccard and line-set Jaccard are interesting but not load-bearing for user success.

5. Popperian Falsification of the Current Approach

Hypothesis: The static, mock-driven RecordedDriver replay infrastructure accurately predicts the live, end-to-end success rates of apr code on real multi-step engineering tasks.

Falsification Condition: If apr code achieves > 0.95 parity scores on the static canonical fixtures (FALSIFY-CCPA-008), but simultaneously scores near 0.0 on live, project-scale outcome benchmarks (e.g., ProgramBench tasks per FALSIFY-CCPA-017), the hypothesis is falsified. The static fixture approach would be proven inadequate as a predictive convergence metric, exposing that dynamic recovery and live LLM feedback loops are the true drivers of project-scale parity.

6. Recommendations for Faster Convergence

To more quickly converge on a working, project-scale solution, we recommend the following tactical shifts:

  1. Deprecate Procedural/Structural Gates: Soft-deprecate FALSIFY-CCPA-014 (OS-event parity). Accept that apr code solves problems structurally differently than Claude Code. Focus exclusively on functional test survival.
  2. Pivot to Live Continuous Evaluation: Stop hand-authoring canonical JSONL traces. Reallocate engineering cycles from ccpa-replayer maintenance to a live "Arena" runner (SWE-Bench / ProgramBench style). End-to-end execution, even if non-deterministic, provides higher-fidelity signal for convergence.
  3. Prioritize Error Recovery over Zero-Shot Determinism: Shift the evaluation focus from zero-shot trajectory matching to the agent's ability to recover from failed bash commands or test runs. Real-world convergence depends on self-correction, which static traces cannot evaluate.