The Phase 6 plan (phase-6-under-contract-bench-plan.md) proposes shifting the measurement from raw code generation to an "under-contract" regime. It integrates pmat comply and pv validate into the agent's per-turn feedback loop. This directly addresses the operator directive that code failing compliance cannot ship, making raw pass rates (like the M150 HumanEval 1.0000) misleading for production environments.
- Shinn et al. 2023 (arXiv:2303.11366) - Reflexion: Language Agents with Verbal Reinforcement Learning:
- Application: Provides the theoretical basis for
compliance_recovery_rate. Exposingpmat complyoutput to the agent as a failed tool use forces iterative self-correction, converting a zero-shot failure into a multi-turn recovery process.
- Application: Provides the theoretical basis for
- Yang et al. 2026 (arXiv:2605.03546) - ProgramBench: Evaluating Language Models at the Project Scale:
- Application: Highlights the necessity of evaluating agents under strict, real-world constraints rather than idealized function-scale tests. Phase 6 operationalizes this by enforcing
paimlorganizational standards (pmatandpv).
- Application: Highlights the necessity of evaluating agents under strict, real-world constraints rather than idealized function-scale tests. Phase 6 operationalizes this by enforcing
- Jimenez et al. 2023 (arXiv:2310.06770) - SWE-bench: Can Language Models Resolve Real-World GitHub Issues?:
- Application: Demonstrates that execution feedback is essential for resolving complex bugs. Phase 6 extends execution feedback from just
cargo testto include static analysis and contract validation.
- Application: Demonstrates that execution feedback is essential for resolving complex bugs. Phase 6 extends execution feedback from just
The proposed dispatch loop intercepts file mutations and forces a compliance check, converting successes into failures if constraints are violated.
Code Example: crates/ccpa-arena/src/dispatch.rs (Proposed)
match tool {
ToolInvocation::Edit { path, ... } | ToolInvocation::Write { path, ... } => {
apply_edit_or_write(...)?;
let sha = sha256_of_file(path);
let compliance = if session.compliance_enforced {
Some(run_pmat_comply_check(cwd))
} else { None };
if let Some(c) = compliance.as_mut() {
if !c.pmat_ok {
// Intercept and report as failure
return ToolResult::FileMutated {
sha256: sha, success: false,
compliance_check: Some(c.clone()),
};
}
}
ToolResult::FileMutated { sha256: sha, success: true, compliance_check: compliance }
}
}Five-Whys Root Cause Analysis:
- Why does the dispatch loop intercept a successful file write and mark it as
success: false? Because the written code violatedpmat comply(e.g., it contained an unallowedunwrap()). - Why is
pmat complytreated as a tool failure instead of an oracle failure at the end? Because we want to measurecompliance_recovery_rate. If we wait until the oracle runs, the agent never gets a chance to see the compliance error and fix it. - Why does the agent need to fix it? Because in the
paimlorg, code that failspmat complycannot be merged, regardless of whethercargo testpasses. - Why is there a risk of a "Compliance Trap" here? Because an LLM that does not understand the
pmatrules might repeatedly try to write the sameunwrap(), looping infinitely until the wall timeout. - Why is this loop penalty an acceptable design? Because evaluating an agent's ability to ingest a novel static analysis error and alter its behavior (Reflexion) is the core metric of Phase 6. The
compliance_cost_ratiodirectly measures this penalty.
Phase 6 redefines success from "tests pass" to "tests pass AND policies pass."
Code Example: crates/ccpa-arena/src/dispatch.rs (Proposed)
fn run_oracle(cwd: &Path, session: &ArenaSession) -> OracleOutcome {
let cargo_test = Command::new("sh").args(["-c", &session.oracle_cmd]).current_dir(cwd).output();
let compliance = if session.compliance_enforced { Some(run_pmat_comply_check(cwd)) } else { None };
let cargo_ok = cargo_test.status.success() && session.oracle_pattern_matches(&cargo_test.stdout);
let comply_ok = compliance.as_ref().map_or(true, |c| c.pmat_ok);
if cargo_ok && comply_ok {
OracleOutcome::Passed
} else {
OracleOutcome::Failed { cargo_ok, comply_ok, stderr: ... }
}
}Five-Whys Root Cause Analysis:
- Why does the oracle compound
cargo_testandcompliancechecks? To simulate the exact CI gate used by thepaimlrepository. - Why re-run
compliancein the oracle if it was already checked in the dispatch loop? Because an agent might make a valid edit, followed by a bash command that mutates state (e.g., generating code via a script) that introduces a compliance violation without triggering theFileMutatedtool intercept. - Why is this distinction important? Because
FALSIFY-CCPA-020requires absolute proof that the final workspace state is compliant before scoring a pass. - Why penalize agents for stylistic violations if the tests pass? Because in a highly structured monorepo, technical debt and invariant violations are as catastrophic as failing tests.
- Why does this matter for the benchmark? It grounds the benchmark in practical utility. A model that scores 100% on HumanEval but 0% on Compound Oracle is useless to the specific engineering context of this organization.
Hypothesis: An agent's raw performance on standard benchmarks (e.g., HumanEval) is an inaccurate predictor of its success rate in a strict, contract-bound engineering environment.
Falsification Condition (via CCPA-020):
If under_contract_pass_rate proves to be statistically indistinguishable from phase_3_pass_rate (i.e., compliance_cost_ratio ≈ 1.0) for both Claude and apr code, then the hypothesis is falsified. This would imply that the organizational contract constraints do not meaningfully impede standard code generation, rendering the Phase 6 machinery unnecessary. The expected true outcome is that compliance_cost_ratio is significantly < 1.0.
Falsification Condition (Recovery Value):
If compliance_recovery_rate is 0.0 for all models (meaning no model ever successfully uses the per-turn pmat feedback to fix a violation), the dispatch loop interception is falsified as a useful mechanism. If models cannot learn from the feedback, the compliance check should simply be moved to the final oracle to save token costs.
- Prioritize CCPA-020 Implementation: The gate ensuring
pmat_ok == trueon terminal records is critical. It must be implemented before any live bench runs are treated as canonical evidence. - Standardize the
stderr_excerpt: Ensure thestderr_excerptpassed back to the LLM (P6.2) is dense with actionable signal. Truncating standardpmatoutput blindly might strip the exact line number needed for recovery. - Formalize Turn Limits: Update P6.3 to explicitly track and limit the number of consecutive compliance-failure turns to avoid runaway token costs (the "Compliance Trap").
- Corpus Bias Acknowledgment: The proposed n=20 corpus spans 4 categories (LeetCode, OO, Transpile, Unix) but remains scoped to
<80 LOC. Acknowledge that this hardens the measurement machinery but does not reach true project-scale parity (e.g., rebuilding SQLite).