Skip to content

Latest commit

 

History

History
93 lines (74 loc) · 7.39 KB

File metadata and controls

93 lines (74 loc) · 7.39 KB

Design Audit: Phase 6 Under-Contract Bench Plan

1. Background & Motivation

The Phase 6 plan (phase-6-under-contract-bench-plan.md) proposes shifting the measurement from raw code generation to an "under-contract" regime. It integrates pmat comply and pv validate into the agent's per-turn feedback loop. This directly addresses the operator directive that code failing compliance cannot ship, making raw pass rates (like the M150 HumanEval 1.0000) misleading for production environments.

2. Citations & Academic Basis

  • Shinn et al. 2023 (arXiv:2303.11366) - Reflexion: Language Agents with Verbal Reinforcement Learning:
    • Application: Provides the theoretical basis for compliance_recovery_rate. Exposing pmat comply output to the agent as a failed tool use forces iterative self-correction, converting a zero-shot failure into a multi-turn recovery process.
  • Yang et al. 2026 (arXiv:2605.03546) - ProgramBench: Evaluating Language Models at the Project Scale:
    • Application: Highlights the necessity of evaluating agents under strict, real-world constraints rather than idealized function-scale tests. Phase 6 operationalizes this by enforcing paiml organizational standards (pmat and pv).
  • Jimenez et al. 2023 (arXiv:2310.06770) - SWE-bench: Can Language Models Resolve Real-World GitHub Issues?:
    • Application: Demonstrates that execution feedback is essential for resolving complex bugs. Phase 6 extends execution feedback from just cargo test to include static analysis and contract validation.

3. Implementation Critiques & Code Examples

3.1. The "Compliance Trap" in the Dispatch Loop

The proposed dispatch loop intercepts file mutations and forces a compliance check, converting successes into failures if constraints are violated.

Code Example: crates/ccpa-arena/src/dispatch.rs (Proposed)

match tool {
    ToolInvocation::Edit { path, ... } | ToolInvocation::Write { path, ... } => {
        apply_edit_or_write(...)?;
        let sha = sha256_of_file(path);

        let compliance = if session.compliance_enforced {
            Some(run_pmat_comply_check(cwd))
        } else { None };

        if let Some(c) = compliance.as_mut() {
            if !c.pmat_ok {
                // Intercept and report as failure
                return ToolResult::FileMutated {
                    sha256: sha, success: false,
                    compliance_check: Some(c.clone()),
                };
            }
        }
        ToolResult::FileMutated { sha256: sha, success: true, compliance_check: compliance }
    }
}

Five-Whys Root Cause Analysis:

  1. Why does the dispatch loop intercept a successful file write and mark it as success: false? Because the written code violated pmat comply (e.g., it contained an unallowed unwrap()).
  2. Why is pmat comply treated as a tool failure instead of an oracle failure at the end? Because we want to measure compliance_recovery_rate. If we wait until the oracle runs, the agent never gets a chance to see the compliance error and fix it.
  3. Why does the agent need to fix it? Because in the paiml org, code that fails pmat comply cannot be merged, regardless of whether cargo test passes.
  4. Why is there a risk of a "Compliance Trap" here? Because an LLM that does not understand the pmat rules might repeatedly try to write the same unwrap(), looping infinitely until the wall timeout.
  5. Why is this loop penalty an acceptable design? Because evaluating an agent's ability to ingest a novel static analysis error and alter its behavior (Reflexion) is the core metric of Phase 6. The compliance_cost_ratio directly measures this penalty.

3.2. The Compound Oracle

Phase 6 redefines success from "tests pass" to "tests pass AND policies pass."

Code Example: crates/ccpa-arena/src/dispatch.rs (Proposed)

fn run_oracle(cwd: &Path, session: &ArenaSession) -> OracleOutcome {
    let cargo_test = Command::new("sh").args(["-c", &session.oracle_cmd]).current_dir(cwd).output();
    let compliance = if session.compliance_enforced { Some(run_pmat_comply_check(cwd)) } else { None };

    let cargo_ok = cargo_test.status.success() && session.oracle_pattern_matches(&cargo_test.stdout);
    let comply_ok = compliance.as_ref().map_or(true, |c| c.pmat_ok);

    if cargo_ok && comply_ok {
        OracleOutcome::Passed
    } else {
        OracleOutcome::Failed { cargo_ok, comply_ok, stderr: ... }
    }
}

Five-Whys Root Cause Analysis:

  1. Why does the oracle compound cargo_test and compliance checks? To simulate the exact CI gate used by the paiml repository.
  2. Why re-run compliance in the oracle if it was already checked in the dispatch loop? Because an agent might make a valid edit, followed by a bash command that mutates state (e.g., generating code via a script) that introduces a compliance violation without triggering the FileMutated tool intercept.
  3. Why is this distinction important? Because FALSIFY-CCPA-020 requires absolute proof that the final workspace state is compliant before scoring a pass.
  4. Why penalize agents for stylistic violations if the tests pass? Because in a highly structured monorepo, technical debt and invariant violations are as catastrophic as failing tests.
  5. Why does this matter for the benchmark? It grounds the benchmark in practical utility. A model that scores 100% on HumanEval but 0% on Compound Oracle is useless to the specific engineering context of this organization.

4. Popperian Falsification of Phase 6

Hypothesis: An agent's raw performance on standard benchmarks (e.g., HumanEval) is an inaccurate predictor of its success rate in a strict, contract-bound engineering environment.

Falsification Condition (via CCPA-020): If under_contract_pass_rate proves to be statistically indistinguishable from phase_3_pass_rate (i.e., compliance_cost_ratio ≈ 1.0) for both Claude and apr code, then the hypothesis is falsified. This would imply that the organizational contract constraints do not meaningfully impede standard code generation, rendering the Phase 6 machinery unnecessary. The expected true outcome is that compliance_cost_ratio is significantly < 1.0.

Falsification Condition (Recovery Value): If compliance_recovery_rate is 0.0 for all models (meaning no model ever successfully uses the per-turn pmat feedback to fix a violation), the dispatch loop interception is falsified as a useful mechanism. If models cannot learn from the feedback, the compliance check should simply be moved to the final oracle to save token costs.

5. Tactical Recommendations

  1. Prioritize CCPA-020 Implementation: The gate ensuring pmat_ok == true on terminal records is critical. It must be implemented before any live bench runs are treated as canonical evidence.
  2. Standardize the stderr_excerpt: Ensure the stderr_excerpt passed back to the LLM (P6.2) is dense with actionable signal. Truncating standard pmat output blindly might strip the exact line number needed for recovery.
  3. Formalize Turn Limits: Update P6.3 to explicitly track and limit the number of consecutive compliance-failure turns to avoid runaway token costs (the "Compliance Trap").
  4. Corpus Bias Acknowledgment: The proposed n=20 corpus spans 4 categories (LeetCode, OO, Transpile, Unix) but remains scoped to <80 LOC. Acknowledge that this hardens the measurement machinery but does not reach true project-scale parity (e.g., rebuilding SQLite).