Why this matters
The 2026-05-22 cross-repo audit (docs) found that consumers use ~25% of the substrate surface. The audit fix (issues #75, tangle-network/agent-builder#191, tax-agent#81, legal-agent#85, creative-agent#124, gtm-agent#136) closes the integration gap — consumers wire what's already there.
This issue closes the closure gap. The higher-order runtime abstractions (TraceEmitter, runEvalCampaign, AnalystRegistry) capture data — they don't auto-invoke the statistical primitives. Each is a pure function the consumer has to call explicitly. So consumers stop at "we ran the campaign" and never run the analysis that would tell them whether their judge ensemble is real, whether they're seeing reward-hacking, or what their failure clusters look like.
Fix: ship one substrate primitive that runs the 5 closure-loop primitives in one call and emits a structured analysis.json artifact. Scaffold-template wires it into every new agent's canonical eval so the loop comes for free.
In scope — 5 primitives, sharp
| Member |
Source primitive |
Why |
irr |
corpusInterRaterAgreementFromJudgeScores(records, ensemble, dims) |
3-judge ensembles across all 4 verticals are unmeasured for IRR. ZERO consumers run it today. |
cost |
{ totalUsd, p50, p95, breaches: budgetBreachView(store, ceilingUsd) } |
Every consumer hard-codes costUsd: 0. Cost as an evaluation dim doesn't exist outside of agent-builder. |
pipelines |
{ failures: failureClusterView(store), regressions: regressionView(store, baseline), judgeAgreement: judgeAgreementView(store), toolWaste: toolWasteView(store), stuckLoops: stuckLoopView(store) } |
5 pure functions over TraceStore. ZERO consumer use. Every consumer would action on a failure-cluster dashboard if they realized it was free. |
rewardHacking |
detectRewardHacking(records) |
Already wired in 3 of 4 verticals — formalize as a suite member with consistent output shape. |
predictiveValidity |
rubricPredictiveValidity(records, outcomeStore) — gated |
Only fires when outcomeStore is non-null. Without persisted deployment outcomes the primitive can't run. Suite degrades cleanly to null for this member, never throws. |
Signature
// src/reporting-suite.ts
export interface EvalReportingSuiteOpts {
/** Judge ids in the ensemble. */
ensemble: string[]
/** Dimensions scored by every judge. */
dims: string[]
/** USD budget ceiling for budgetBreachView. Optional; defaults to skip. */
budgetCeilingUsd?: number
/** Baseline run id for regression diff. Optional; defaults to skip. */
baselineRunId?: string
/** Outcome store for predictiveValidity. Null skips that member cleanly. */
outcomeStore?: OutcomeStore | null
}
export interface EvalReportingSuiteReport {
runId: string
generatedAt: string
irr: { perDim: Record<string, { icc: number; kappaWeighted: number; bootstrapCi: [number, number]; n: number }>; pooled: { icc: number; kappaWeighted: number } } | null
cost: { totalUsd: number; p50: number; p95: number; breaches: BudgetBreachView } | null
pipelines: {
failures: FailureClusterView
regressions: RegressionView | null
judgeAgreement: JudgeAgreementView
toolWaste: ToolWasteView
stuckLoops: StuckLoopView
}
rewardHacking: RewardHackingReport | null
predictiveValidity: PredictiveValidityReport | null
}
export async function evalReportingSuite(
records: RunRecord[],
traceStore: TraceStore,
opts: EvalReportingSuiteOpts,
): Promise<EvalReportingSuiteReport>
Each member is null when its input requirement isn't met (no IRR if records[].outcome.judgeScores is missing; no predictiveValidity if outcomeStore is omitted). Fail-loud doctrine: never silently zero a member — null means "not computable from these inputs," and the caller can see why.
Output
<runDir>/analysis.json alongside records.jsonl / scores.json. Read by:
- agent-builder's
/app/admin/findings UI to surface IRR + failure-cluster trends
- Consumer scaffold's
pnpm eval:report script (rendered by agent-builder scaffold)
- CI gates (e.g. fail if
irr.pooled.icc < 0.5 — ensemble has collapsed to one opinion)
Scaffold change (cross-repo)
Extends agent-builder#191 Half B (scaffold expansion). The rendered canonical-eval.ts template emits a --report flag that fires evalReportingSuite post-run. The rendered eval-report.ts script reads analysis.json and prints a human-readable summary. Every newly-scaffolded agent inherits the closure loop without thinking about it.
Explicit non-goals (deferred to #8 — substrate triage)
These primitives exist in the substrate but NOT in the reporting suite. They go through #8's triage instead:
- PRM module (
PrmGrader, prmBestOfN, prmEnsembleBestOfN) — no consumer step-grades or trains a process reward model
- Training-data exporters (
toDpoRows, toGrpoRows, toSftRows, toPrmRows) — no consumer fine-tunes against this output
createAntiSlopJudge — consumers run text-keyword anti-slop heuristics that don't match the substrate primitive's shape
calibrationCurves, full correlationStudy — useful in theory, no consumer persists outcomes to feed them
runBehavioralCanaries — consumers use ad-hoc canary tests; substrate primitive shape needs validation
These are NOT in the suite because firing them without a consumer use case adds noise to analysis.json that nobody actions on. Better outcome: triage them via #8 (keep behind @experimental, find a real use case, or delete).
Completion checklist
Coordination
Companion docs
Why this matters
The 2026-05-22 cross-repo audit (docs) found that consumers use ~25% of the substrate surface. The audit fix (issues #75, tangle-network/agent-builder#191, tax-agent#81, legal-agent#85, creative-agent#124, gtm-agent#136) closes the integration gap — consumers wire what's already there.
This issue closes the closure gap. The higher-order runtime abstractions (
TraceEmitter,runEvalCampaign,AnalystRegistry) capture data — they don't auto-invoke the statistical primitives. Each is a pure function the consumer has to call explicitly. So consumers stop at "we ran the campaign" and never run the analysis that would tell them whether their judge ensemble is real, whether they're seeing reward-hacking, or what their failure clusters look like.Fix: ship one substrate primitive that runs the 5 closure-loop primitives in one call and emits a structured
analysis.jsonartifact. Scaffold-template wires it into every new agent's canonical eval so the loop comes for free.In scope — 5 primitives, sharp
irrcorpusInterRaterAgreementFromJudgeScores(records, ensemble, dims)cost{ totalUsd, p50, p95, breaches: budgetBreachView(store, ceilingUsd) }costUsd: 0. Cost as an evaluation dim doesn't exist outside of agent-builder.pipelines{ failures: failureClusterView(store), regressions: regressionView(store, baseline), judgeAgreement: judgeAgreementView(store), toolWaste: toolWasteView(store), stuckLoops: stuckLoopView(store) }TraceStore. ZERO consumer use. Every consumer would action on a failure-cluster dashboard if they realized it was free.rewardHackingdetectRewardHacking(records)predictiveValidityrubricPredictiveValidity(records, outcomeStore)— gatedoutcomeStoreis non-null. Without persisted deployment outcomes the primitive can't run. Suite degrades cleanly tonullfor this member, never throws.Signature
Each member is
nullwhen its input requirement isn't met (no IRR ifrecords[].outcome.judgeScoresis missing; nopredictiveValidityifoutcomeStoreis omitted). Fail-loud doctrine: never silently zero a member — null means "not computable from these inputs," and the caller can see why.Output
<runDir>/analysis.jsonalongsiderecords.jsonl/scores.json. Read by:/app/admin/findingsUI to surface IRR + failure-cluster trendspnpm eval:reportscript (rendered by agent-builder scaffold)irr.pooled.icc < 0.5— ensemble has collapsed to one opinion)Scaffold change (cross-repo)
Extends agent-builder#191 Half B (scaffold expansion). The rendered
canonical-eval.tstemplate emits a--reportflag that firesevalReportingSuitepost-run. The renderedeval-report.tsscript readsanalysis.jsonand prints a human-readable summary. Every newly-scaffolded agent inherits the closure loop without thinking about it.Explicit non-goals (deferred to #8 — substrate triage)
These primitives exist in the substrate but NOT in the reporting suite. They go through #8's triage instead:
PrmGrader,prmBestOfN,prmEnsembleBestOfN) — no consumer step-grades or trains a process reward modeltoDpoRows,toGrpoRows,toSftRows,toPrmRows) — no consumer fine-tunes against this outputcreateAntiSlopJudge— consumers run text-keyword anti-slop heuristics that don't match the substrate primitive's shapecalibrationCurves, fullcorrelationStudy— useful in theory, no consumer persists outcomes to feed themrunBehavioralCanaries— consumers use ad-hoc canary tests; substrate primitive shape needs validationThese are NOT in the suite because firing them without a consumer use case adds noise to
analysis.jsonthat nobody actions on. Better outcome: triage them via #8 (keep behind@experimental, find a real use case, or delete).Completion checklist
src/reporting-suite.tscreated withevalReportingSuite+EvalReportingSuiteOpts+EvalReportingSuiteReporttypesnullwith no throwtests/reporting-suite.test.ts— every member has both happy-path and null-path coverageRunRecord[]corpus, assertanalysis.jsonshape; assert all members non-null with full inputs; assert specific members null with partial inputs (no outcomeStore → predictiveValidity null; no budgetCeilingUsd → cost.breaches null)src/index.ts(root) + capability-area "Run record + outcome shape"tests/consumer-contract.test.tsupdated to pin the new exportstemplates/eval-report.tsrenders thepnpm eval:reportconsumer surfaceCoordination
pnpm eval:reportscriptRunRecord.outcome.judgeScores: JudgeScoresRecord(0.31.0 — shipped) and the consumer specs migrating off the side-channel cast (in flight via the 4 consumer issues)Companion docs