You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue body carries the executive summary + completion checklist + cross-spec coordination.
The full spec contains all migration tasks with file:line / current code / target code / why / test impact.
1. Executive summary
The five-vertical audit identified that all four product consumers (tax, legal, creative, gtm) hand-roll the same five patterns, legal carries two additional patterns that should be universal, and agent-builder ships one effect-size helper (cliffsDelta) the substrate is missing. Eight new primitives absorb every one of these patterns and unblock ~500 lines of duplicated drift deletion across consumers.
Effort estimate (calendar): 2-3 engineer days end-to-end.
T01, T03, T05, T08: pure-function helpers, ~30-60 LOC each + unit tests. Half a day for all four.
T02: ~120 LOC + integration tests against a stub backend. Half a day.
T04: ~80 LOC + golden-file tests round-tripping a real OtlpExport. Half a day.
T06 + T07: ~150 LOC + a crash-resume integration test that simulates the legal scenario. One day.
Wire up consumer-contract.test.ts updates, package.json exports, CHANGELOG.md, dist/openapi.json regenerate. Half a day.
Impact:
~500+ lines of duplicated drift deleted across tax/legal/creative/gtm/agent-builder once consumer specs land (full counts in §2).
Cross-family judge enforcement consolidated to one regex map. No more drift between lib/judge-ensemble.ts versions.
Fetch capture moves from four hand-rolled implementations (one of which legal has TWO copies of) to one substrate primitive that already understands the redactor + provider-derivation chain.
Three OTLP flatteners replaced by one canonical projection that round-trips through OtlpFileTraceStore without consumers re-deriving the line shape.
runDurableEval lifts legal's stale-lease-reclaim + per-persona checkpoint pattern so tax/creative/gtm can adopt durability without copying 200+ lines of legal's canonical.ts.
cliffsDelta becomes a substrate primitive alongside pairedWilcoxon + pairedBootstrap, removing agent-builder's "the substrate doesn't ship one — it's small enough to keep here" comment at differential-eval.ts:83.
Non-goals (explicit, deferred to 0.33 or later — see §8):
MultiTurnScenarioPayload<TBehavior> generic (agent-builder local). Deferred to 0.33 per audit synthesis line 38.
C18tests/paired-stats.test.ts extended — ≥6 cases including Romano-threshold boundaries + parity check vs agent-builder's local impl (T08).
Export verification (root + subpaths)
C19src/index.ts re-exports added for T01 (assertCrossFamily, judgeFamily, JudgeFamilyError, JudgeFamily).
C20src/index.ts re-exports added for T02 (captureFetchToRawSink, CaptureFetchContext, CaptureFetchOptions).
C21src/index.ts re-exports added for T03 (weightedComposite, WeightedCompositeInput, WeightedCompositeResult, CompositeError).
C22src/index.ts re-exports added for T04 (flattenOtlpExportToNdjson, FlattenOtlpOptions, OtlpFlatLine) — via the existing export * from './trace-analyst' at line 292.
C23src/index.ts re-exports added for T05 (assertSingleBackend, BackendDescriptor, SingleBackendReport, AssertSingleBackendOptions, SingleBackendError).
C30tests/consumer-contract.test.ts: ROOT_ERROR_CLASSES extended with JudgeFamilyError, CompositeError, SingleBackendError.
C31 Build artifacts regenerate: pnpm build && pnpm openapi — verify dist/index.d.ts contains every new export with the correct stability tag (@stable for T01/T02/T03/T04/T05/T08; @experimental for T06/T07).
Documentation + release ops
C32CHANGELOG.md: 0.32.0 section authored, every primitive listed with one-line description + audit-driven motivation.
C33docs/concepts.md: update if (and only if) the conceptual mental-model changed. T01-T05 + T08 fit existing capability areas; T06-T07 add a new "Durable eval orchestration" callout in the campaign section.
C34.claude/skills/agent-eval/SKILL.md: add directives for the new primitives so consumer migrations land with the right shape from the first prompt. One directive per primitive, citing the consumer file that motivated it.
C35 No docs/wire-protocol.md change required — none of the new primitives sit on the wire surface.
C40/tmp/audit/spec-agent-builder.md migration section authored, cross-referencing T03/T08 (agent-builder doesn't need T01/T02/T04 — already substrate-native; T05/T06 are open enhancements).
Release
C41 Release branch release/0.32.0 cut from main; PR opened with all eight primitives + tests + exports + changelog.
C42 All tests green in CI (pnpm test && pnpm typecheck && pnpm lint); consumer-contract test confirms the new symbols are exported.
C43@tangle-network/agent-eval@0.32.0 published to npm; consumer specs unblocked.
10. Downstream coordination
This is the first spec to ship from the cross-repo audit because the four consumer specs and the agent-builder spec all reference substrate primitives that don't exist yet.
Release ordering
┌─────────────────────────────────────────────────────────┐
│ THIS SPEC — agent-eval 0.32.0 │
│ ships T01-T08, all 8 primitives │
│ unblocks every consumer migration │
└─────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
spec-tax-agent spec-legal-agent spec-creative-agent
spec-gtm-agent spec-agent-builder
(5 specs; can ship in parallel after 0.32.0 lands)
Cross-link verification
Each consumer spec must cite the substrate primitives it depends on, by primitive id (T01-T08). The substrate spec (this doc) does not need to know the consumer details — it just promises the primitives ship at 0.32.0 and pin their signatures.
The audit author's deliverable:
After this spec is filed as a GitHub issue + 0.32.0 lands, each consumer spec gets a "References substrate primitives: T0X, T0Y, T0Z (delivered in @tangle-network/agent-eval@0.32.0)" pin at the top.
No retro-changes to this spec after 0.32.0 publishes — any post-publish iteration is a 0.33.0 spec.
Communication
PR description must include:
Link to /tmp/audit/SYNTHESIS.md (motivation: five-vertical drift).
Link to each consumer audit (tax-agent-integration.md, legal-agent-integration.md, creative-agent-integration.md, gtm-agent-integration.md, agent-builder-integration.md).
Pin to this spec doc.
Reviewer hint: every new export must be in consumer-contract.test.ts:ROOT_RUNTIME_SYMBOLS (or ROOT_ERROR_CLASSES for the three new error classes).
Post-publish:
Each consumer repo's CLAUDE.md / SKILL.md updates to reference the new primitives in the migration spec — same PR that does the migration.
A "substrate 0.32.0 adoption" tracking issue in tangle-network/tangle-ops (Drew's ops board) lists the four consumer specs and closes each one as it lands.
1. Executive summary
The five-vertical audit identified that all four product consumers (tax, legal, creative, gtm) hand-roll the same five patterns, legal carries two additional patterns that should be universal, and agent-builder ships one effect-size helper (
cliffsDelta) the substrate is missing. Eight new primitives absorb every one of these patterns and unblock ~500 lines of duplicated drift deletion across consumers.assertCrossFamily(judges, opts)+judgeFamily(modelId)src/judge-families.ts@stablecaptureFetchToRawSink(fetch, sink, opts)src/trace/capture-fetch.ts./traces@stableweightedComposite({ dims, weights, threshold? })src/composite.ts@stableflattenOtlpExportToNdjson(export, opts)src/trace-analyst/otlp-flatten.ts./traces@stableassertSingleBackend(agent, judge, opts)src/integrity/single-backend.ts@stablerunDurableEval<TPersona, TResult>(opts)src/optimization/durable-eval.ts./optimization@experimentalbuildPersonaErrorResult(personaId, error, opts)src/optimization/durable-eval.ts./optimization@experimentalcliffsDelta(before, after)+interpretCliffs(d)src/paired-stats.ts(extend)./reporting@stableEffort estimate (calendar): 2-3 engineer days end-to-end.
OtlpExport. Half a day.consumer-contract.test.tsupdates,package.jsonexports,CHANGELOG.md,dist/openapi.jsonregenerate. Half a day.Impact:
lib/judge-ensemble.tsversions.OtlpFileTraceStorewithout consumers re-deriving the line shape.runDurableEvallifts legal's stale-lease-reclaim + per-persona checkpoint pattern so tax/creative/gtm can adopt durability without copying 200+ lines of legal'scanonical.ts.cliffsDeltabecomes a substrate primitive alongsidepairedWilcoxon+pairedBootstrap, removing agent-builder's "the substrate doesn't ship one — it's small enough to keep here" comment atdifferential-eval.ts:83.Non-goals (explicit, deferred to 0.33 or later — see §8):
MultiTurnScenarioPayload<TBehavior>generic (agent-builder local). Deferred to 0.33 per audit synthesis line 38.MetricsRollup/BackendIntegrityReportextensions. Deferred.agent-builder, not substrate.5. Completion checklist (43 boxes)
Primitive ship + test
src/judge-families.tscreated,assertCrossFamily+judgeFamily+JudgeFamilyError+JudgeFamilyexported (T01).tests/judge-families.test.tscreated, ≥10 cases passing including regex coverage of all 30+ consumer model ids (T01).src/trace/capture-fetch.tscreated,captureFetchToRawSinkexported, redactor and provider-derivation inherited from./raw-provider-sink(T02).tests/trace/capture-fetch.test.tscreated — happy path, error path, redaction, truncation, capture-failure handling (T02).captureFetchToRawSinkinto a realcreateOpenAICompatibleBackendagainst a local stub server (T02).src/composite.tscreated,weightedComposite+CompositeError+ types exported (T03).tests/composite.test.tscreated — ≥8 cases including NaN rejection, weight normalisation, threshold pass/fail (T03).src/trace-analyst/otlp-flatten.tscreated,flattenOtlpExportToNdjson+ types exported (T04).tests/trace-analyst/otlp-flatten.test.tscreated, includes round-trip throughOtlpFileTraceStore(T04).src/integrity/single-backend.tscreated,assertSingleBackend+SingleBackendError+ types exported (T05).tests/integrity/single-backend.test.tscreated — ≥7 cases including the legal failure scenario regression (T05).src/optimization/durable-eval.tscreated,runDurableEval+buildPersonaErrorResult+ types exported (T06 + T07).tests/optimization/durable-eval.test.tscreated — unit cases againstInMemoryDurableRunStore(T06).FileSystemDurableRunStore(process.exit mid-loop, re-invoke, assert resume) — replicates legal'sLEGAL_EVAL_CRASH_AFTER_PERSONAscenario (T06).lease.json, invoke withretryOnStaleLease: true, assert recovery (T06).buildPersonaErrorResulttest — returns avalidateRunRecord-passing record thatassertRealBackendclassifies as stub (T07).src/paired-stats.tsextended withcliffsDelta+interpretCliffs+CliffsDeltaResult+CliffsMagnitude(T08).tests/paired-stats.test.tsextended — ≥6 cases including Romano-threshold boundaries + parity check vs agent-builder's local impl (T08).Export verification (root + subpaths)
src/index.tsre-exports added for T01 (assertCrossFamily,judgeFamily,JudgeFamilyError,JudgeFamily).src/index.tsre-exports added for T02 (captureFetchToRawSink,CaptureFetchContext,CaptureFetchOptions).src/index.tsre-exports added for T03 (weightedComposite,WeightedCompositeInput,WeightedCompositeResult,CompositeError).src/index.tsre-exports added for T04 (flattenOtlpExportToNdjson,FlattenOtlpOptions,OtlpFlatLine) — via the existingexport * from './trace-analyst'at line 292.src/index.tsre-exports added for T05 (assertSingleBackend,BackendDescriptor,SingleBackendReport,AssertSingleBackendOptions,SingleBackendError).src/index.tsre-exports added for T06 + T07 (runDurableEval,buildPersonaErrorResult,RunDurableEvalOptions,RunDurableEvalResult,BuildPersonaErrorResultOptions).src/index.tspaired-stats re-export block (lines 988-993) extended for T08 (cliffsDelta,interpretCliffs,CliffsDeltaResult,CliffsMagnitude).src/optimization.tsre-exports added for T06 + T07 (durable-eval subpath surface).src/traces.tspropagation verified — T02 + T04 reachable via./traces(no edit needed; verify in build).src/reporting.tsre-exports added for T08 (cliffsDelta,interpretCliffs,CliffsDeltaResult,CliffsMagnitude).tests/consumer-contract.test.ts:ROOT_RUNTIME_SYMBOLSextended withassertCrossFamily,judgeFamily,captureFetchToRawSink,weightedComposite,flattenOtlpExportToNdjson,assertSingleBackend,runDurableEval,buildPersonaErrorResult,cliffsDelta,interpretCliffs.tests/consumer-contract.test.ts:ROOT_ERROR_CLASSESextended withJudgeFamilyError,CompositeError,SingleBackendError.pnpm build && pnpm openapi— verifydist/index.d.tscontains every new export with the correct stability tag (@stablefor T01/T02/T03/T04/T05/T08;@experimentalfor T06/T07).Documentation + release ops
CHANGELOG.md: 0.32.0 section authored, every primitive listed with one-line description + audit-driven motivation.docs/concepts.md: update if (and only if) the conceptual mental-model changed. T01-T05 + T08 fit existing capability areas; T06-T07 add a new "Durable eval orchestration" callout in the campaign section..claude/skills/agent-eval/SKILL.md: add directives for the new primitives so consumer migrations land with the right shape from the first prompt. One directive per primitive, citing the consumer file that motivated it.docs/wire-protocol.mdchange required — none of the new primitives sit on the wire surface.Consumer migration preparation
/tmp/audit/spec-tax-agent.mdmigration section authored, cross-referencing T01/T02/T03/T05/T08./tmp/audit/spec-legal-agent.mdmigration section authored, cross-referencing T01/T02/T04/T05/T06/T07./tmp/audit/spec-creative-agent.mdmigration section authored, cross-referencing T01/T02/T03/T04/T05/T08./tmp/audit/spec-gtm-agent.mdmigration section authored, cross-referencing T01/T03/T04/T05/T08./tmp/audit/spec-agent-builder.mdmigration section authored, cross-referencing T03/T08 (agent-builder doesn't need T01/T02/T04 — already substrate-native; T05/T06 are open enhancements).Release
release/0.32.0cut frommain; PR opened with all eight primitives + tests + exports + changelog.pnpm test && pnpm typecheck && pnpm lint); consumer-contract test confirms the new symbols are exported.@tangle-network/agent-eval@0.32.0published to npm; consumer specs unblocked.10. Downstream coordination
This is the first spec to ship from the cross-repo audit because the four consumer specs and the agent-builder spec all reference substrate primitives that don't exist yet.
Release ordering
Cross-link verification
Each consumer spec must cite the substrate primitives it depends on, by primitive id (T01-T08). The substrate spec (this doc) does not need to know the consumer details — it just promises the primitives ship at 0.32.0 and pin their signatures.
The audit author's deliverable:
@tangle-network/agent-eval@0.32.0)" pin at the top./tmp/audit/spec-agent-builder.md) cites T03 + T08 + (optionally) T05/T06.Communication
PR description must include:
/tmp/audit/SYNTHESIS.md(motivation: five-vertical drift).tax-agent-integration.md,legal-agent-integration.md,creative-agent-integration.md,gtm-agent-integration.md,agent-builder-integration.md).consumer-contract.test.ts:ROOT_RUNTIME_SYMBOLS(orROOT_ERROR_CLASSESfor the three new error classes).Post-publish:
tangle-network/tangle-ops(Drew's ops board) lists the four consumer specs and closes each one as it lands.Companion docs
Filed automatically from the 2026-05-22 cross-repo audit.