Skip to content

[0.32.0] Absorb 8 hand-rolled patterns from 5 consumers (assertCrossFamily / captureFetchToRawSink / weightedComposite / flattenOtlpExportToNdjson / assertSingleBackend / runDurableEval / buildPersonaErrorResult / cliffsDelta) #75

@tangletools

Description

@tangletools

Canonical full spec (1200-2000 lines): https://github.com/tangle-network/agent-eval/blob/chore/cross-repo-eval-audit-2026q2/docs/audits/2026-05-22-cross-repo/spec-agent-eval-substrate.md

This issue body carries the executive summary + completion checklist + cross-spec coordination.
The full spec contains all migration tasks with file:line / current code / target code / why / test impact.


1. Executive summary

The five-vertical audit identified that all four product consumers (tax, legal, creative, gtm) hand-roll the same five patterns, legal carries two additional patterns that should be universal, and agent-builder ships one effect-size helper (cliffsDelta) the substrate is missing. Eight new primitives absorb every one of these patterns and unblock ~500 lines of duplicated drift deletion across consumers.

# Primitive File Surface Stability
T01 assertCrossFamily(judges, opts) + judgeFamily(modelId) src/judge-families.ts root @stable
T02 captureFetchToRawSink(fetch, sink, opts) src/trace/capture-fetch.ts root + ./traces @stable
T03 weightedComposite({ dims, weights, threshold? }) src/composite.ts root @stable
T04 flattenOtlpExportToNdjson(export, opts) src/trace-analyst/otlp-flatten.ts root + ./traces @stable
T05 assertSingleBackend(agent, judge, opts) src/integrity/single-backend.ts root @stable
T06 runDurableEval<TPersona, TResult>(opts) src/optimization/durable-eval.ts root + ./optimization @experimental
T07 buildPersonaErrorResult(personaId, error, opts) src/optimization/durable-eval.ts root + ./optimization @experimental
T08 cliffsDelta(before, after) + interpretCliffs(d) src/paired-stats.ts (extend) root + ./reporting @stable

Effort estimate (calendar): 2-3 engineer days end-to-end.

  • T01, T03, T05, T08: pure-function helpers, ~30-60 LOC each + unit tests. Half a day for all four.
  • T02: ~120 LOC + integration tests against a stub backend. Half a day.
  • T04: ~80 LOC + golden-file tests round-tripping a real OtlpExport. Half a day.
  • T06 + T07: ~150 LOC + a crash-resume integration test that simulates the legal scenario. One day.
  • Wire up consumer-contract.test.ts updates, package.json exports, CHANGELOG.md, dist/openapi.json regenerate. Half a day.

Impact:

  • ~500+ lines of duplicated drift deleted across tax/legal/creative/gtm/agent-builder once consumer specs land (full counts in §2).
  • Cross-family judge enforcement consolidated to one regex map. No more drift between lib/judge-ensemble.ts versions.
  • Fetch capture moves from four hand-rolled implementations (one of which legal has TWO copies of) to one substrate primitive that already understands the redactor + provider-derivation chain.
  • Three OTLP flatteners replaced by one canonical projection that round-trips through OtlpFileTraceStore without consumers re-deriving the line shape.
  • runDurableEval lifts legal's stale-lease-reclaim + per-persona checkpoint pattern so tax/creative/gtm can adopt durability without copying 200+ lines of legal's canonical.ts.
  • cliffsDelta becomes a substrate primitive alongside pairedWilcoxon + pairedBootstrap, removing agent-builder's "the substrate doesn't ship one — it's small enough to keep here" comment at differential-eval.ts:83.

Non-goals (explicit, deferred to 0.33 or later — see §8):

  • MultiTurnScenarioPayload<TBehavior> generic (agent-builder local). Deferred to 0.33 per audit synthesis line 38.
  • MetricsRollup / BackendIntegrityReport extensions. Deferred.
  • Scaffold-template propagation (the "biggest lever" finding). That's a separate spec in agent-builder, not substrate.

5. Completion checklist (43 boxes)

Primitive ship + test

  • C01 src/judge-families.ts created, assertCrossFamily + judgeFamily + JudgeFamilyError + JudgeFamily exported (T01).
  • C02 tests/judge-families.test.ts created, ≥10 cases passing including regex coverage of all 30+ consumer model ids (T01).
  • C03 src/trace/capture-fetch.ts created, captureFetchToRawSink exported, redactor and provider-derivation inherited from ./raw-provider-sink (T02).
  • C04 tests/trace/capture-fetch.test.ts created — happy path, error path, redaction, truncation, capture-failure handling (T02).
  • C05 Integration test wires captureFetchToRawSink into a real createOpenAICompatibleBackend against a local stub server (T02).
  • C06 src/composite.ts created, weightedComposite + CompositeError + types exported (T03).
  • C07 tests/composite.test.ts created — ≥8 cases including NaN rejection, weight normalisation, threshold pass/fail (T03).
  • C08 src/trace-analyst/otlp-flatten.ts created, flattenOtlpExportToNdjson + types exported (T04).
  • C09 tests/trace-analyst/otlp-flatten.test.ts created, includes round-trip through OtlpFileTraceStore (T04).
  • C10 src/integrity/single-backend.ts created, assertSingleBackend + SingleBackendError + types exported (T05).
  • C11 tests/integrity/single-backend.test.ts created — ≥7 cases including the legal failure scenario regression (T05).
  • C12 src/optimization/durable-eval.ts created, runDurableEval + buildPersonaErrorResult + types exported (T06 + T07).
  • C13 tests/optimization/durable-eval.test.ts created — unit cases against InMemoryDurableRunStore (T06).
  • C14 Crash-resume integration test against FileSystemDurableRunStore (process.exit mid-loop, re-invoke, assert resume) — replicates legal's LEGAL_EVAL_CRASH_AFTER_PERSONA scenario (T06).
  • C15 Stale-lease reclaim test — manually write lease.json, invoke with retryOnStaleLease: true, assert recovery (T06).
  • C16 buildPersonaErrorResult test — returns a validateRunRecord-passing record that assertRealBackend classifies as stub (T07).
  • C17 src/paired-stats.ts extended with cliffsDelta + interpretCliffs + CliffsDeltaResult + CliffsMagnitude (T08).
  • C18 tests/paired-stats.test.ts extended — ≥6 cases including Romano-threshold boundaries + parity check vs agent-builder's local impl (T08).

Export verification (root + subpaths)

  • C19 src/index.ts re-exports added for T01 (assertCrossFamily, judgeFamily, JudgeFamilyError, JudgeFamily).
  • C20 src/index.ts re-exports added for T02 (captureFetchToRawSink, CaptureFetchContext, CaptureFetchOptions).
  • C21 src/index.ts re-exports added for T03 (weightedComposite, WeightedCompositeInput, WeightedCompositeResult, CompositeError).
  • C22 src/index.ts re-exports added for T04 (flattenOtlpExportToNdjson, FlattenOtlpOptions, OtlpFlatLine) — via the existing export * from './trace-analyst' at line 292.
  • C23 src/index.ts re-exports added for T05 (assertSingleBackend, BackendDescriptor, SingleBackendReport, AssertSingleBackendOptions, SingleBackendError).
  • C24 src/index.ts re-exports added for T06 + T07 (runDurableEval, buildPersonaErrorResult, RunDurableEvalOptions, RunDurableEvalResult, BuildPersonaErrorResultOptions).
  • C25 src/index.ts paired-stats re-export block (lines 988-993) extended for T08 (cliffsDelta, interpretCliffs, CliffsDeltaResult, CliffsMagnitude).
  • C26 src/optimization.ts re-exports added for T06 + T07 (durable-eval subpath surface).
  • C27 src/traces.ts propagation verified — T02 + T04 reachable via ./traces (no edit needed; verify in build).
  • C28 src/reporting.ts re-exports added for T08 (cliffsDelta, interpretCliffs, CliffsDeltaResult, CliffsMagnitude).
  • C29 tests/consumer-contract.test.ts: ROOT_RUNTIME_SYMBOLS extended with assertCrossFamily, judgeFamily, captureFetchToRawSink, weightedComposite, flattenOtlpExportToNdjson, assertSingleBackend, runDurableEval, buildPersonaErrorResult, cliffsDelta, interpretCliffs.
  • C30 tests/consumer-contract.test.ts: ROOT_ERROR_CLASSES extended with JudgeFamilyError, CompositeError, SingleBackendError.
  • C31 Build artifacts regenerate: pnpm build && pnpm openapi — verify dist/index.d.ts contains every new export with the correct stability tag (@stable for T01/T02/T03/T04/T05/T08; @experimental for T06/T07).

Documentation + release ops

  • C32 CHANGELOG.md: 0.32.0 section authored, every primitive listed with one-line description + audit-driven motivation.
  • C33 docs/concepts.md: update if (and only if) the conceptual mental-model changed. T01-T05 + T08 fit existing capability areas; T06-T07 add a new "Durable eval orchestration" callout in the campaign section.
  • C34 .claude/skills/agent-eval/SKILL.md: add directives for the new primitives so consumer migrations land with the right shape from the first prompt. One directive per primitive, citing the consumer file that motivated it.
  • C35 No docs/wire-protocol.md change required — none of the new primitives sit on the wire surface.

Consumer migration preparation

  • C36 /tmp/audit/spec-tax-agent.md migration section authored, cross-referencing T01/T02/T03/T05/T08.
  • C37 /tmp/audit/spec-legal-agent.md migration section authored, cross-referencing T01/T02/T04/T05/T06/T07.
  • C38 /tmp/audit/spec-creative-agent.md migration section authored, cross-referencing T01/T02/T03/T04/T05/T08.
  • C39 /tmp/audit/spec-gtm-agent.md migration section authored, cross-referencing T01/T03/T04/T05/T08.
  • C40 /tmp/audit/spec-agent-builder.md migration section authored, cross-referencing T03/T08 (agent-builder doesn't need T01/T02/T04 — already substrate-native; T05/T06 are open enhancements).

Release

  • C41 Release branch release/0.32.0 cut from main; PR opened with all eight primitives + tests + exports + changelog.
  • C42 All tests green in CI (pnpm test && pnpm typecheck && pnpm lint); consumer-contract test confirms the new symbols are exported.
  • C43 @tangle-network/agent-eval@0.32.0 published to npm; consumer specs unblocked.

10. Downstream coordination

This is the first spec to ship from the cross-repo audit because the four consumer specs and the agent-builder spec all reference substrate primitives that don't exist yet.

Release ordering

        ┌─────────────────────────────────────────────────────────┐
        │  THIS SPEC — agent-eval 0.32.0                          │
        │  ships T01-T08, all 8 primitives                        │
        │  unblocks every consumer migration                       │
        └─────────────────────────────────────────────────────────┘
                                  │
              ┌───────────────────┼───────────────────┐
              │                   │                   │
              ▼                   ▼                   ▼
        spec-tax-agent     spec-legal-agent    spec-creative-agent
        spec-gtm-agent     spec-agent-builder
        (5 specs; can ship in parallel after 0.32.0 lands)

Cross-link verification

Each consumer spec must cite the substrate primitives it depends on, by primitive id (T01-T08). The substrate spec (this doc) does not need to know the consumer details — it just promises the primitives ship at 0.32.0 and pin their signatures.

The audit author's deliverable:

  • After this spec is filed as a GitHub issue + 0.32.0 lands, each consumer spec gets a "References substrate primitives: T0X, T0Y, T0Z (delivered in @tangle-network/agent-eval@0.32.0)" pin at the top.
  • The agent-builder spec (/tmp/audit/spec-agent-builder.md) cites T03 + T08 + (optionally) T05/T06.
  • No retro-changes to this spec after 0.32.0 publishes — any post-publish iteration is a 0.33.0 spec.

Communication

PR description must include:

  • Link to /tmp/audit/SYNTHESIS.md (motivation: five-vertical drift).
  • Link to each consumer audit (tax-agent-integration.md, legal-agent-integration.md, creative-agent-integration.md, gtm-agent-integration.md, agent-builder-integration.md).
  • Pin to this spec doc.
  • Reviewer hint: every new export must be in consumer-contract.test.ts:ROOT_RUNTIME_SYMBOLS (or ROOT_ERROR_CLASSES for the three new error classes).

Post-publish:

  • Each consumer repo's CLAUDE.md / SKILL.md updates to reference the new primitives in the migration spec — same PR that does the migration.
  • A "substrate 0.32.0 adoption" tracking issue in tangle-network/tangle-ops (Drew's ops board) lists the four consumer specs and closes each one as it lands.

Companion docs

Filed automatically from the 2026-05-22 cross-repo audit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions