The headline shift: a feature PR's eval can now answer the question a single run cannot — did this change regress persona P on profile F, even while the aggregate improved?
AgentProfile+agentProfileHash— the harness's unit of variation. Model lives inside the profile (skill/tool order doesn't matter; theidlabel is excluded from identity), so "same model, different skills" is two profiles. (#78)- Append-only JSONL scorecard keyed
(scenarioId, profileHash)—recordRuns/recordRunsToScorecard/loadScorecard. Idempotent appends oneventIdso concurrent campaign runs cannot clobber. (#78) diffScorecard— per-cell verdict (improved/regressed/flat/new) using Cohen's d + Welch's t-test; the keystone CI guard isdiff.cells.filter(c => c.verdict === 'regressed').formatScorecardDiffrenders the PR-facing report. (#78)- Agent profile cells —
src/agent-profile-cell.tsextends the profile contract intoRunRecordrows andrunEvalCampaignso every campaign row is keyed by(profile, scenario, seed)end-to-end. (#79) - Stats consolidation —
pairedBootstrap, power analysis, and the paired/Welch primitives now all live insrc/statistics.ts. (#73) - LLM retry classifier unified across
llm-clientandjudge-retryviaisTransientLlmError. (#74) pr-review-benchmarksource committed — the module was exported fromindex.tssince the run-record refactor but the source files were never committed; CI onmainhas been red on #78/#79/#81 as a result. (#83)- Examples:
scorecard/,held-out-gate/,user-simulation-driver/. (#81)
No breaking changes — additive across the board.
0.32.0 shipped the completion oracle (verifyCompletion,
extractProducedState) but decideNextUserTurn — the standalone reactive
adversarial turn generator — merged after the 0.32.0 tag and never made it
into a published tarball. Consumers wiring an in-process eval loop against the
driver could import the symbol from source but not from npm.
This release publishes main as-is: decideNextUserTurn,
DecideNextUserTurnOpts, the completion verifier, and produced-state
extraction are all in dist/. No source changes — a republish that closes the
tag/npm drift.
verifyCompletion(gold, state, checkCorrectness)— the task-completion oracle. Two-stage per requirement: structural match against produced state, then an injected correctness check.completionRate/fullyCompletegate quality scoring — a fluent transcript that never produces the deliverable scores zero.extractProducedState(events)— normalizes a run'sRuntimeStreamEvent[]intoProducedState{ artifacts, proposals, toolCalls }.createLlmCorrectnessChecker(tc)— productionCorrectnessChecker.decideNextUserTurn(tc, opts)— standalone reactive adversarial turn generator extracted fromAgentDriver, for in-process eval loops.
The v0.31.0 tag's npm tarball shipped a stale dist/ — JudgeScoresRecord
was missing from dist/index.d.ts and the recordOutcome.judgeScores
propagation never made it into dist/index.js, even though the source on
the tagged commit had both. Consumers that bumped to ^0.31.0 got a
typecheck failure on RunOutcome.judgeScores (since the type wasn't
re-exported) and a silent drop on the wire (since the campaign runner
didn't carry the field through).
Cause: a build artifact picked up by the publish workflow predated the
source merge. The retag forces a clean pnpm build and republish; this
patch carries no source change beyond the version bump.
Verified after this tag: dist/index.d.ts contains JudgeScoresRecord,
dist/index.js propagates outcome.judgeScores end-to-end via
recordOutcome.judgeScores, and a downstream pnpm install @tangle-network/agent-eval@0.31.1 types-clean against the shape
documented in 0.31.0.
Multi-judge consumers (forge-chat in agent-builder, and four sibling
product agents on the same trajectory) compute per-judge per-dimension
scores per cell, then collapse to a single composite for the gate. The
substrate's RunOutcome only had a slot for the composite plus a free
raw: Record<string, number> bag. Consumers were either dropping the
breakdown on the floor or smuggling it through stringly-typed raw
keys like judge_kimi_helpfulness — neither survives a corpus-IRR run
(0.27.2's corpusInterRaterAgreement expects structured per-judge
per-dim records, not parsed strings).
This release ships the typed slot so every product agent speaks the same shape, and the inter-rater primitives consume it without a per-consumer adapter.
JudgeScoresRecord(src/run-record.ts) —perJudge[judgeId][dim]is the canonical store;perDimMeanandcompositeare precomputed projections so reporters and IRR primitives don't repeat the aggregation;failedJudges?: string[]records dead-judge ids explicitly (no inferring partial-failure from missing keys);notes?: stringcarries panel prose.RunOutcome.judgeScores?: JudgeScoresRecord— optional. Single- judge or scalar-only runs leave it unset; ensemble runs populate it.CampaignRunOutcome.judgeScores?: JudgeScoresRecord— runners return it on the per-cell outcome;runEvalCampaignthreads it onto the resultingRunRecord.outcome.judgeScoreswithout coercion.
validateRunRecord validates outcome.judgeScores when present.
Every perJudge[judge][dim] and every perDimMean[dim] and the
composite must be finite numbers — the NaN-as-silent-zero bug class
banned by CLAUDE.md cannot pass the boundary. failedJudges must be
an array of non-empty strings; notes must be a string. Round-trip
tested in tests/run-record.test.ts.
A judge that throws lands in failedJudges by id, not a silent zero
in perJudge. The composite is computed over surviving judges only;
the partial-failure signal is preserved through to the gate.
tests/eval-campaign.test.ts covers the four shapes (full, partial,
missing, with notes) plus an explicit fail-loud case where one judge
throws and the run record carries failedJudges: ['glm-5.1@...'].
tests/consumer-contract.test.ts pins JudgeScoresRecord as a
type-level export at the root entry. The 0.30.0 surface is preserved —
the new field is additive on RunOutcome and the new type is a new
export, so existing consumers stay green.
Builds on 0.28.0's analyst registry. Ships four trace-analyst kinds that emit graded findings through native Ax structured output (no more flat-defaulted bullet lists) and a cross-run findings context the registry can inject into prompts so each kind sees what the prior run already surfaced.
createTraceAnalystKind(spec, opts)(src/analyst/kind-factory.ts) — turns aTraceAnalystKindSpecinto a registry-readyAnalyst<TraceAnalysisStore>. Ax signature is'question:string -> findings:json[]'; the Zod boundary infinding-signature.tsrejects malformed rows instead of lifting them with default severity. SupportsversionSuffixfor optimizer-fitted prompts (MIPRO / GEPA / Bootstrap) and a per-rowpostProcesshook.RawAnalystFindingZod schema +RAW_FINDING_SCHEMA_PROMPTstring embedded into kind actor prompts so the model and the parser share one source of truth.TraceToolGroupName+buildTraceToolsForGroup(src/analyst/tool-groups.ts) — five named tool subsets (all | discovery | discoveryAndRead | discoveryAndSearch | targeted); unknown group names throw.- Four shipping kinds (
src/analyst/kinds/):FAILURE_MODE_KIND_SPEC— clusters dataset failures into distinct modes (maxDepth 3, parallel 4, all tools).KNOWLEDGE_GAP_KIND_SPEC— attributes missing/stale knowledge toagent-knowledge:wiki:*,websearch:outdated:*,tool-doc:*,system-prompt:*,memory:*(maxDepth 2, discoveryAndSearch).KNOWLEDGE_POISONING_KIND_SPEC— dual-verify analyst for confident-but-wrong actions (maxDepth 2, all tools).IMPROVEMENT_KIND_SPEC— converts upstream failure / gap / poisoning findings into concrete locus-named edits with leverage grades (maxDepth 3, all tools).
DEFAULT_TRACE_ANALYST_KINDS— the four specs in canonical run order (failure-mode → gap → poisoning → improvement).priorFindingsonAnalystContext— registry injects findings from a priorAnalystRunResultinto every analyst's context, so an improvement-kind run can see the failure-mode findings the previous pass surfaced. Kinds reference prior findings viaevidence_uri: "finding://<id>".
createTraceAnalystAdapter(src/analyst/adapters.ts) — the legacy bullet-list lifter. Kept for one minor while consumers migrate tocreateTraceAnalystKind.
A generic, model-agnostic orchestration layer over the existing
analyzers (analyzeTraces, MultiLayerVerifier, RunCritic,
SemanticConceptJudge, JudgeFn). One contract, one runner, one
persistence path. Reusable by VB operator bench, leaderboard submission
pipeline, and orchestrator on-completion reports with the same code.
Analyst<TInput>contract +AnalystFindingenvelope with sha-stablefinding_id(src/analyst/types.ts).AnalystRegistry(src/analyst/registry.ts) — register/list/run with input routing byinputKind, per-analyst isolation, equal-split budget by default, per-analyst telemetry.AnalystHooks—onBeforeAnalyze | onAfterAnalyze | onError | onComplete. Generic seam for telemetry, cost ingestion, rotation, error → finding conversion.BudgetPolicy—{ totalUsd, weights, allocate }. Default equal-split; weighted split or customallocate(args)for precision.ChatClientabstraction (src/analyst/chat-client.ts) overrouter | sandbox-sdk | cli-bridge | direct-provider | mockso analyst code is transport-agnostic;wrapLlmClientraces the call againstChatCallOpts.signal.FindingsStore+diffFindings(prev, cur, { isMaterial })(src/analyst/findings-store.ts) — locked JSONL persistence + cross-run diff (appeared / disappeared / persisted / changed) with a pluggable materiality predicate (defaultIsMaterialexported for layering).- Five adapter factories (
src/analyst/adapters.ts) that lift existing primitives into the contract without re-implementing them:createTraceAnalystAdapter,createVerifierAdapter,createRunCriticAdapter,createJudgeAdapter,createSemanticConceptJudgeAdapter.
interRaterReliability(JudgeScore[][]) measures Krippendorff α within
a single item — multiple judges rate the same scenario, how much do
their scores cluster? That answers "is this one judgement contested?"
It does not answer "is this judge panel reliable across the whole
evaluation corpus?" — the question the five product consumers actually
need before trusting a multi-judge composite over 100+ scenarios.
This release ships the corpus-wide companion. It does not touch the existing primitive: the within-item α and the corpus-wide ICC are different formulas with different domains of validity.
corpusInterRaterAgreement(records, opts?)(src/statistics.ts) — takes a flat list of{itemId, judgeName, dimension, score}records. For each dimension, pivots to the [n_items × n_judges] matrix of items every judge rated and delegates tocontinuousAgreement(ICC(2,1) + κ_w + Pearson + Spearman + bootstrap CIs from 0.26.0). An overall pooled mean across dimensions gives one "is the panel reliable on this corpus?" number.corpusInterRaterAgreementFromJudgeScores(itemsScores, opts?)— adapter for consumers that already hold per-itemJudgeScore[]arrays (e.g.ScenarioResult.judgeScores) and want to skip manual flattening.- New exported types:
CorpusScoreRecord,CorpusAgreementOptions,CorpusAgreementPerDimension,CorpusAgreementReport.
Per CLAUDE.md "no silent fallbacks": the primitive throws
ValidationError on empty input, fewer than 2 judges, fewer than 2
items rated by every judge on a given dimension, a judge with zero
items on a dimension (would silently shrink the matrix and corrupt the
overall metric), duplicate (itemId, judge, dimension) records, or any
non-finite score. There is no quiet-NaN path.
tests/consumer-contract.test.ts pins both new exports. The 0.27.0
surface is preserved — no rename, no signature change on the existing
interRaterReliability.
sandbox-harness.ts— the timeout-drivenSIGKILLpreviously sat inside an empty} catch {}. A failed kill would vanish from logs. It now surfaces viaconsole.warnwith full error context while preserving teardown semantics (the timer already fired; the subprocess is being terminated).control-runtime.ts— documented theControlRunResult.runId: string | nullcontract at the type declaration. The 18 sites that coerceemitter?.runIdtonull(one per terminal return path) are typed-contract conversions, not silent fallbacks:nullmeans "the run executed without aTraceEmitterwired and no run record was persisted." Type-level docs end the recurring "is this a bug?" review. (Three sibling?? nullcoercions on the same returns —actionCostUsd,scoreBefore,scoreAfter— are likewise typed-optional span attributes documented at their declaration sites.).gitignore— addeddata/(local dev session storage).tests/consumer-contract.test.ts— pins the runtime symbols that the five product-agent consumers (tax/creative/legal/gtm/agent-builder) import from@tangle-network/agent-eval. The full set of types is validated at compile time via the namespace import; runtime classes and functions are exhaustively asserted. Any removal/rename of a load-bearing export now fails this test before shipping.
Today's tax + gtm evals shipped composites where the judge LLM silently
aborted (verbose new prompts streamed past the 60s default timeout) and
the per-trial score collapsed to 0. The composite formula then weighted
that zero into the mean, producing a "−27pp tax regression" that was
actually a measurement-instrument failure, not a prompt regression.
This release adds three substrate primitives so consumers can stop silent-zeroing their own data:
withJudgeRetry(judgeFn, policy)— wraps any judge call with retry on transient failures (Abort, Timeout, fetch failed, 429/502/503/504), optional fallback-model rotation, and a typed outcome (succeeded,attempts,value,error). Refuses to default to a silent zero.aggregateTrialsByMode(trials, { mode })—'exclude-failed'mode drops trials withjudgeSucceeded === falsefrom the mean so a failed judge doesn't corrupt the composite.'strict-fail'mode refuses the aggregate when any judge failed.'zero-fill'preserves legacy.discoverPersonas(dir, opts)— replaces every consumer's hardcodedTRAINING_PERSONA_FILESconstant. New personas on disk are picked up automatically; consumers can filter via include/exclude patterns.
Additive to TrialResult: judgeSucceeded?, judgeAttempts?, judgeError?
fields. Existing adapters that don't set these continue to work
unchanged via 'zero-fill' mode (default for back-compat).
The original calibrateJudge rounded scores to ints before computing
Cohen's κ. For fine-grained judges that's lossy — 0.78 vs 0.81 both
round to "1" and the integer κ pretends they agreed perfectly when they
actually disagree by 3 percentage points. This release ships principled
continuous-value agreement metrics so calibration findings become
quantitative for [0,1]-valued judges.
-
continuousAgreement(scores, opts?)(src/judge-calibration.ts) — inter-rater agreement on continuous scores. Returns:weightedKappa— Cohen's κ_w with quadratic (or linear) weights on raw scores, no quantisation.icc— ICC(2,1), two-way random effects, absolute agreement, single rater (Shrout & Fleiss 1979). The principled reliability coefficient when judges are a random sample of the judge population.pearson/spearman— averaged over rater pairs when N ≥ 2 raters.ci.icc/ci.weightedKappa— bootstrap percentile 95% CIs (defaultn=1000, seeded for reproducibility). Acceptsscores: number[][]shaped[n_items][n_raters]. Rows with non-finite entries are dropped, not coerced.
-
calibrateJudgeContinuous(golden, candidate, opts?)— drop-in superset ofcalibrateJudge. Preserves every legacy field (n,pearson,kappa,mae,worstItems) and addsweightedKappaContinuous,icc,spearman, andci. Use this when the judge produces fine-grained [0,1] scores; keepcalibrateJudgefor the original integer-quantised report.
ICC(2,1) catches systematic bias that Pearson misses. If judge B scores 2× judge A, Pearson stays ≈ 1 (linear association is perfect) while ICC plummets (absolute agreement is poor). The new tests assert this exact failure mode so the regression can't sneak back in.
calibrateJudgekeeps its original integer-rounded κ semantics for backwards compatibility. Nothing else moves.
This release ships the orchestration layer that turns the existing
eval substrate into a continuously-improving production system. Static
prompts decay; today's regulation flips tomorrow. The pieces to close
the loop were already in the package (runMultiShotOptimization,
failureClusterView, evaluateReleaseConfidence, extractPreferences,
FeedbackTrajectoryStore, TraceStore); this release adds the one
clean primitive that wires them together end-to-end.
-
runProductionLoop({ ... })(src/production-loop.ts,@experimental) — one call = one cycle. Ingests production traces and feedback, clusters failures, runs evolve against the worst cluster, gates withHeldOutGate+evaluateReleaseConfidence(fail-closed), and — when wired with anAutoPrClient— opens a PR with the improved prompt. Idempotent + replayable: samerunIdyields the same plan. Cron / GitHub Actions are the consumer's job; the primitive doesn't own scheduling. -
proposeAutomatedPullRequest(client, input)+ two transports (src/auto-pr.ts,@experimental):httpGithubClient({ token, ... })— direct REST againstapi.github.com, no extra deps. Idempotent on branch name: existing open PRs are returned, not duplicated.ghCliClient({ ... })— shells out toghfor environments where developer auth state is already configured. Both validate inputs (no..paths, no whitespace branches, no duplicate file changes) and surfaceValidationError/ConfigErrorfrom the typed taxonomy.
-
POST /v1/feedback+POST /v1/traces/ingestwire endpoints (src/wire/). Both Zod-validated, both append to the configured store (FeedbackTrajectoryStore/TraceStore). 503 when no store is wired (fail loud, not silent). Traces ingest accepts bothapplication/json({events:[...]}) andapplication/x-ndjsonfor streaming production runtimes. Schemas (TraceEvent,FeedbackTrajectory,TracesIngestRequest/Response,FeedbackIngestResponse) added toopenapi.jsonfor cross-language clients. -
Optional bearer-token auth on the wire server, configured via
createApp({ auth: { bearer: '...' } })or as a verifier function for rotating tokens./healthzand/v1/versionremain unprotected (regression: never lock monitoring out of the runtime). -
examples/production-loop/— synthetic end-to-end demo wiring the loop against in-memory trace + feedback stores and a fake auto-PR client. Shows the failure-cluster trigger, the evolve round, the gate verdict, and the PR-shaped output without requiring credentials or a live model.
- Wire server (
createApp(opts)) now accepts optionalIngestionStores({ traceStore?, feedbackStore? }) andauth. Existing zero-arg callers continue to work — judge / rubrics / version / healthz are unchanged.
- Every new export is
@experimentalinitially. Pin the patch version if you depend on it. All other 0.24.0 stability tags are preserved.
This release is DX + correctness. No production behavior moved; consumer contracts tightened across the board. Library went from 7.5/10 to 10/10 on first-touch usability and contract clarity. The visible deltas:
noUncheckedIndexedAccess: trueintsconfig.json. 251 latentT | undefinedsites surfaced and fixed across ~70 files. Loop-bound indices documented with!, external lookups guarded explicitly, accumulator patterns refactored to capture-then-assign. Every fix audited for semantic correctness (math code:!; untrusted data: guards).- Subpath imports forced. Six
export * from './X'wildcards at root deleted (./rl,./pipelines,./builder-eval,./meta-eval,./prm,./trace-analyst). New subpaths inpackage.json:/pipelines,/meta-eval,/prm,/builder-eval,/governance,/knowledge. Root re-exports retained only for the load-bearing capture-integrity surface (./trace,./knowledge,./governance). - Error taxonomy. New
src/errors.tsexportsAgentEvalErrorbase plusValidationError,NotFoundError,ConfigError,CaptureIntegrityError,JudgeError,VerificationError,ReplayError. Existing custom errors re-parented:ReplayCacheMissError,BudgetBreachError,RunIntegrityError,HoldoutLockedError,RunRecordValidationError,LlmCallError,LlmRouteAssertionError,TraceFileMissingError,TraceNotFoundError,SpanNotFoundError. ~25 user-facingthrow new Error(...)calls migrated to typed errors acrossrl/*,replay,sandbox-harness,statistics,release-confidence,visual-diff,counterfactual,run-critic,observability. Internal invariant guards intentionally left as plainError— those are bugs, not contract failures. LlmRouteAssertionError.code→reason(breaking, greenfield). The subclass's route-specific reason now lives on.reason; the base categorycode = 'capture_integrity'survives via theAgentEvalErrorcontract.
-
README reframed as the substrate for self-improving agents. The package has shipped
EvalCampaign, replay, GEPA / reflective mutation, auto-research, active curriculum, contamination probes, tournaments, compute curves, PRM, off-policy estimators, and sequential anytime-valid stats since 0.22 — the README now actually names them, not just "evaluation infrastructure." -
src/rl/index.tscarries stability markers — every re-export is tagged@stableor@experimentalvia JSDoc. Stable:run-record-adapters,verifiable-reward,preferences,off-policy,tournament,contamination,compute-curves. Experimental:process-reward,adversarial,active-curriculum,reward-hacking,adaptation-eval,exporters,rl-campaign,predictive-validity-researcher,auto-research. Tags are visible in IDE hover and emitted intodist/rl.d.tsso consumers can see the contract at the call site.
- Biome lint + format —
biome.jsoncodifies the project style (no semicolons, single quotes, 2-space indent, 100 col,noNonNullAssertionoff,useNodejsImportProtocolon).pnpm lintandpnpm formatscripts. .github/workflows/ci.yml— runs typecheck + lint + test + build + Python pytest on every PR. Previously only the publish workflow on tag push exercised this surface; PRs were unguarded.ReplayCache.entries()— public iterator for the cached(request, response)pairs. Replaces the bracket-access escape hatch into the privatebyKeymap. Same semantics, exposed in the type contract.- Per-example READMEs —
examples/multi-shot-optimizationandexamples/same-sandbox-harnessnow document what they show, how to run, expected output, and adaptation guidance. The other three examples already had READMEs; the README index now links to all five. clients/python/examples/judge_anti_slop.py— runnable script that doubles as a pytest, anchoring thejudgeAPI contract: composite in[0, 1],RubricNotFoundErrorfor bogus rubric name,ValidationErrorfor no-rubric call.
reflective-mutation.ts— localescapevariable shadowed the globalescapeproperty. Renamed toescaped. No behavior change; flagged by biome.
-
FileSystemTraceStore.updateRun/updateSpan— once the lazy in-memory index had been populated (by any priorgetRun/listRuns/spans/eventsquery), anupdateRunwould mirror the synthetic update row back into the index viaappendRun, throwingrun X already exists. Same root cause forupdateSpan, which would silently insert a phantom duplicate span row. Theappend()helper now skipsinsertIntofor rows carrying the internal_update: truemarker;updateRun/updateSpancontinue to apply the patch directly via the index'supdateRun/updateSpanAPIs.Surfaced by tax-agent's canonical eval running multiple variants per persona against a shared store: the second variant's
endRunconsistently threw, forcing callers to instantiate one store per (persona × variant) cell and stitch results back together post-hoc. After this fix, a singleFileSystemTraceStorecan fan out runs across arbitrarily many cells with interleaved reads, which is the intended usage pattern. Regression test added intests/trace-store.test.ts.
In addition to the RL bridge primitives below, this release ships the canonical worked example of the auto-research loop end-to-end against agent-builder, plus a concrete prime-rl SFT integration. The auto-research thesis — capture → score → preferences → mutate → improved candidate — is now demonstrably real, not aspirational.
examples/auto-research-with-agent-builder/— runnable demo of the closed loop: a synthetic agent-builder driver iterates 4 generations of prompt variants, with each generation's runs feedinganalyzeOptimizationResultfor preferences + reward-hacking + sequential verdict, and the next generation proposed via a deterministic mutator. The demo shows score climbing from 0.739 → 0.973 over 4 iterations on the synthetic environment. Real-driver mode (replace the synthetic runner withrunForgeBuilderSimfromagent-builder) is documented inline.examples/fine-tune-with-prime-rl/— concrete integration with Prime Intellect's prime-rl SFT trainer. ReadsRunRecord[](NDJSON), filters to high-quality runs, projects viatoSftRowsto messages-list JSONL, writes a 15-line prime-rl SFT TOML config, prints the runnable command. ~150 LoC of glue. SFT was chosen as the first integration because it's the cleanest fit between agent-eval's exporters and prime-rl's entrypoints (DPO/PRM go to TRL; offline GRPO requires a custom verifiers env — both called out in the README).docs/three-package-architecture.md— the contracts between agent-eval, agent-knowledge, agent-runtime. Dependency direction (both consume agent-eval; agent-eval imports neither), shared data interchange (RunRecord, Scenario, KnowledgeBundle), and known contract gaps tracked as follow-ups.docs/auto-research-loop-end-to-end.md— the runnable composition pattern with the explicit invariants every iteration must preserve (canonical RunRecord with scenarioId, capture wired by construction, stable comparator, deterministic mutator).
0.22 made eval rigorous and integrated; 0.23 closes the loop back to RL training. The package now ships the canonical primitives a working RL-on-LLM-agents team needs — verifiable rewards, preference extraction, off-policy evaluation, process reward scaffolding, contamination probing, Bradley-Terry / Elo tournaments, adversarial scenario search, and test-time compute scaling — all designed to consume the standardised RunRecord artifact 0.22 produced. The auto-research loop is now coherent end-to-end.
A single subpath for every RL-shaped primitive, importable as a unit. The 9 modules:
-
run-record-adapters.ts— convertTrialResult[](fromrunPromptEvolution/runMultiShotOptimization),VerificationReport(fromMultiLayerVerifier), andVariantAggregateinto canonicalRunRecord[]. Closes the integration gap between the pre-0.22 optimization stack and the post-0.22 campaign artifact. Existing optimization runs becomereplayCache-able andrubricPredictiveValidity-scorable for free. -
verifiable-reward.ts— extract a cleanVerifiableRewardfromVerificationReportorRunRecord. Distinguishes'deterministic'(compile, test, schema, sandbox) from'probabilistic'(judge) reward sources. The seam every credible 2025-2026 frontier RL result on coding agents leans on (DeepSeek-R1 GRPO on test pass-rate, AlphaProof on Lean kernel checking). -
preferences.ts—extractPreferences(runRecords)produces DPO/PPO/KTO-shape(chosen, rejected)triples with three documented strategies (paired-by-scenario-and-seed,paired-by-scenario,top-vs-bottom). Bridge from campaign artifact to RL training. IncludestoTRLFormatandtoAnthropicFormatadapters. -
off-policy.ts— IPS, SNIPS, doubly-robust off-policy estimators (Dudík–Langford–Li 2011 for DR, Owen 2013 for SNIPS SE). Caller supplies behavior + target propensity scores (typically from token log-probs). All three return matched-shapeOffPolicyEstimatewith effective-sample-size and max-importance-weight diagnostics.offPolicyEstimateAllruns all three side-by-side — agreement across estimators is a much stronger signal than any one alone. -
process-reward.ts— step-level credit assignment from trace spans.extractStepRewards(store, runId, scorers)producesStepReward[];prmTrainingPairs(stepRewardsByRun)produces(prefix, chosen_step, rejected_step)triples in the canonical Lightman et al. / DeepSeek-R1 process supervision shape. We ship the data extraction, not the trainer — gradient descent over a transformer is out of scope for a TS package. -
contamination.ts— held-out perturbation contamination probe.runContaminationProbe({ originals, perturbation, scoreFn })runs the policy against original + perturbed scenarios, computes paired Wilcoxon on the deltas, and flags suspected contamination when median drop ≥ 5pp at p < 0.05. Stock perturbations:renameVariables,shuffleOrder,injectIrrelevantClause. Catches the SWE-Bench → SWE-Bench-Verified failure mode upstream. -
tournament.ts—fitBradleyTerry(outcomes)uses Hunter's MM algorithm to recover candidate strengths from pairwise outcomes;applyEloUpdate(ratings, outcome)for online updates with FIDE-style K-factor.buildPairwiseFromCampaignextracts pairwise outcomes from per-scenario campaign runs. Sample-efficient ranking for many-candidate sweeps; the methodology Chatbot Arena and AlpacaEval converged on. -
adversarial.ts—adversarialScenarioSearch({ seeds, mutations, scoreFn })actively searches for inputs the policy fails on. Hill-climb-against-failure-indicator loop (the simplest version of AdA / POET / auto-jailbreak rigs). Caller supplies mutation strategies; the harness deduplicates, budgets, and reports per-generation statistics. -
compute-curves.ts— characterize a candidate as a curve across compute budgets, not a point.runComputeCurveproduces(cost, score)points + log-slope.bestOfN,selfConsistencyare the canonical test-time-scaling primitives (Snell et al. 2024).paretoFrontierremoves dominated (candidate, compute) combinations. Required for honest cost-quality reporting in the o1-era.
The 9 modules above are stable and tested. The following modules are also shipped under @tangle-network/agent-eval/rl as experimental — interfaces are reasonable but may evolve based on real production consumer feedback. Marked clearly in the barrel docstring; flagged here so consumers know the contract may shift.
-
active-curriculum.ts— adaptive scenario allocation.varianceBasedCurriculum(Neyman 1934 optimal allocation: weight ∝ √variance + 1/√n for under-sampled-cell tie-break) andthompsonCurriculum(Beta-Bernoulli posterior + decision-threshold-weighted sampling) reallocate next-round budget toward cells whose outcome is uncertain. -
reward-hacking.ts—detectRewardHacking({ runs, truthOf })watches four signature signals (proxy-vs-truth divergence, distributional shift, reward disagreement between independent rewards, judge drift relative to deterministic reward) and returns a structured'clean' | 'suspect' | 'gaming'verdict with per-signal severity. Krakovna et al. + Skalse et al. 2022 + Kim et al. 2023 lineage. -
adaptation-eval.ts—runAdaptationCurveandcompareAdaptationCurvesfor sample-efficient adaptation evaluation. The metric a foundation-model-based agent should be measured on isn't end-state performance but the curve of score vs k (k=0, 1, 2, 4, 8, 16 demonstrations). Returns area-under-curve summary + per-k bootstrap CIs. -
exporters.ts— trainer-format export functions.toDpoRows(HuggingFace TRL DPO/IPO/KTO format),toGrpoRows(offline GRPO{prompt, completions[], rewards[]}),toSftRows(TRL/prime-rl SFT messages list),toPrmRows(Lightman-style PRM training shape),stepRewardsToJsonl(step-level rewards for value-function regression). Honest scope:toSftRowsis the only one that maps directly onto a prime-rl entrypoint; the others target TRL or custom trainers — seeexamples/fine-tune-with-prime-rl/README.mdfor the explicit fit table. -
rl-campaign.ts—runRLCampaign(opts)wrapsrunEvalCampaignand runs the full RL bridge (verifiable rewards + preferences + sequential interim verdict + reward-hacking + optional predictive validity + optional trainer export) in one call. The single top-level orchestrator the pre-0.23 audit panel called out as missing. -
auto-research.ts—analyzeOptimizationResult({ result, ctx, comparator })takes aPromptEvolutionResultorMultiShotOptimizationResult(the existing GEPA/AxRLM stack outputs) and runs the same RL bridge on top, producing a unified artifact. Closes the architectural fragmentation between the optimization primitives and the RL bridge. -
predictive-validity-researcher.ts—PredictiveValidityResearcheris a concreteResearcherinterface implementation (the interface had been a placeholder +NoopResearcheruntil now). Drives steering changes from outcome-anchored predictive validity: rubrics that don't predict deployment outcomes get down-weighted; load-bearing rubrics get up-weighted. -
run-record.ts—RunRecord.scenarioIdis now an optional canonical field (was previously inferred fromoutcome.raw.scenario_id). Populated automatically byrunEvalCampaignand the optimization adapters; legacyRunRecord[]arrays without it fall back to theoutcome.raw.scenario_idconvention. Closes the fragility called out by the 0.23 audit.
- New build entry:
dist/rl.{js,d.ts}exposed via the@tangle-network/agent-eval/rlpackage subpath. - All RL primitives also re-exported from the root barrel for ergonomic single-import use.
- Default
BradleyTerrysmoothing raised from 0 to 0.1 — Hunter's MM degenerates when a candidate has zero wins; 0.1 keeps the iteration well-conditioned without meaningfully biasing real win counts.
The previous release shipped EvalCampaign + replay + sequential + outcome calibration as parallel infrastructure to the existing optimization primitives. That left a real gap: runMultiShotOptimization and runPromptEvolution produced their own trial shapes that didn't compose with the new artifacts. 0.23 closes that gap with the adapter layer, and ships the eight downstream primitives that turn the unified artifact into RL training data, OPE estimates, contamination probes, tournament rankings, adversarial scenarios, and compute curves.
After 0.23, the auto-research loop is coherent end-to-end:
mutate (existing primitives)
→ trial outcomes (TrialResult)
→ adapter (run-record-adapters)
→ RunRecord[] (canonical artifact)
→ preferences / verifiable rewards / OPE / step rewards
→ policy update (consumer's choice of TRL / GRPO / PPO / DPO)
→ next sweep
- Dudík, M., Langford, J., Li, L. (2011). Doubly Robust Policy Evaluation and Learning. ICML.
- Owen, A. B. (2013). Monte Carlo Theory, Methods and Examples. Ch. 9 — Importance Sampling.
- Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry models. Annals of Statistics, 32(1), 384–406.
- Bradley, R. A., Terry, M. E. (1952). Rank analysis of incomplete block designs. Biometrika, 39(3/4).
- Lightman, H. et al. (2023). Let's Verify Step by Step. arXiv:2305.20050.
- Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally. arXiv:2408.03314.
- Plus the foundational citations from 0.21 / 0.22.
All 0.23 primitives are additive. Existing consumers don't need to change. Recommended adoption sequence:
- Add
trialsToRunRecords(trials, ctx)after every existing optimization sweep — every old run becomes replay-able and predictive-validity-scorable for free. - Wire
extractVerifiableRewardinto your scoring pipeline; route deterministic and probabilistic rewards into separate training batches. - Use
extractPreferencesto produce DPO/PPO triples for any RL training the consumer runs. - Run
rubricPredictiveValidityquarterly +runContaminationProbeper release to keep the rubric weights honest. - Replace fixed-comparator HeldOutGate with
fitBradleyTerryonce you have ≥ 5 candidates running on shared scenarios. - Replace single-budget evaluation with
runComputeCurvefor any candidate where compute scaling is a question.
- The DR estimator's Q-function is caller-supplied. We don't ship a learned Q-function trainer — that's a regression problem with too many domain-specific choices to ship a default.
- PRM training itself (gradient descent over a transformer) is out of scope; we ship the data extraction shape.
- The contamination probe's per-scenario q-values use a heuristic pseudo-p (the load-bearing test is the global Wilcoxon).
prmTrainingPairsmatches trajectories by step name + kind; production use should replace this with a token-level prefix hash.- Adversarial scenario search is a simple hill-climb; novel scenario synthesis (compositional, language-model-driven) is future work.
0.21 shipped the four capture-integrity primitives as opt-in. Every consumer still had to wire them by hand, and the bug class blueprint-agent reported (forgotten wiring → silent partial-capture) reappears the moment a new consumer adopts agent-eval cold. 0.22 makes the right thing the default path — and adds three primitives that compound on top of standardized capture: replay-from-raw-events, anytime-valid sequential evaluation, and rubric predictive validity. The four primitives together turn agent-eval from a TS framework into research-grade evaluation infrastructure.
Opinionated matrix runner that wires the four directives by construction. Inputs: variants, scenarios, seeds, an LlmClientOptions, factories for TraceStore and RawProviderSink, and a runner(ctx) callback. Outputs: per-cell RunRecord[], RunIntegrityReport[], optional researchReport, and a campaign fingerprint.
- Preflight:
assertLlmRouteis called once before any work, with{ requireExplicitBaseUrl: true, requireAuth: true }defaults. Misconfigured routes never burn a run. - Per run: the campaign constructs the
TraceStore,RawProviderSink, andTraceEmitter(withonRunCompletehooks attached), then hands the runner anLlmClientOptionsalready pre-wired withrawSink+traceContext. The runner cannot accidentally call an LLM without capture. - Run-completion:
assertRunCapturedruns after everyendRunwith{ llmSpansMin: 1, requireRawCoverageOfLlmSpans: true, requireOutcome: true }defaults. Failures are routed viaonIntegrityFailure: 'throw' | 'mark_failed' | 'log'(default'mark_failed'). - End of campaign: if
report.comparatoris set, computesresearchReportover the collectedRunRecords and embeds the campaign fingerprint +preregistrationHash. - Concurrency: local async worker pool, default 1, configurable via
concurrency. - Determinism: the default
runIdgenerator is a stable hash of(campaignId, variantId, scenarioId, seed), so re-running the same campaign produces the same ids; overriderunIdfor non-deterministic generation.
Exported from the root barrel and the @tangle-network/agent-eval/optimization subpath: runEvalCampaign, CampaignRunner, CampaignRunContext, CampaignRunOutcome, CampaignVariant, CampaignScenario, EvalCampaignOptions, EvalCampaignResult, FailedRun, CampaignIntegrityPolicy, CampaignFactoryParams.
Every campaign run is now a re-runnable artifact. ReplayCache.fromSink(sink) turns a populated RawProviderSink into a deterministic (canonicalised request → cached response) map; createReplayFetch(cache) returns a fetch-shaped function that satisfies /chat/completions calls out of the cache and passes other URLs through.
const cache = await ReplayCache.fromSink(yesterdayRawSink)
const replayFetch = createReplayFetch(cache, { onMiss: 'fail-closed' })
await callLlm(req, { ...llmOpts, fetch: replayFetch }) // zero LLM costUse cases:
- Post-hoc judging — apply a new judge or scorer to last week's runs without burning a single token.
- Determinism audits — replay a campaign and verify the responses match byte-for-byte.
- Free judge calibration — run two judges on identical responses and measure agreement.
onMiss is 'throw' | 'fallback' | 'fail-closed'. The cache hashes a canonical projection (model + messages + temperature + max_tokens|max_completion_tokens + response_format) so insertion-order quirks don't cause spurious misses.
Exported from root and @tangle-network/agent-eval/traces: ReplayCache, createReplayFetch, iterateRawCalls, ReplayCacheEntry, ReplayCacheStats, ReplayFetchOptions, ReplayCacheMissError.
pairedEvalueSequence(deltas, opts) and evaluateInterimReleaseConfidence({ deltaSeries }) ship the predictable plug-in betting martingale of Waudby-Smith & Ramdas (2024) for paired bounded outcomes, plus the empirical Bernstein confidence sequence of Howard et al. (2021) for the running mean. Both are anytime-valid — type-I error is bounded by α at every stopping time, no peeking penalty.
const verdict = evaluateInterimReleaseConfidence({
deltaSeries: [{ candidateId: 'cand', deltas }],
alpha: 0.05,
rope: { low: -0.02, high: 0.02 },
})
// → { recommendation: { decision: 'promote_now' | 'continue' | 'reject_now' | 'equivalent', candidateId } }This closes the methodological hole flagged in the 0.21 methodology doc as out-of-scope. Consumers running rolling campaigns can now ship the moment evidence is decisive, stop-early on dead-on-arrival variants, and accumulate evidence across partial runs without spending the FDR budget. Tested under-the-null at α=0.05 on 100 synthetic series; false-rejection rate stays below the bound.
Exported from root and @tangle-network/agent-eval/reporting: pairedEvalueSequence, evaluateInterimReleaseConfidence, PairedEvalueOptions, PairedEvalueSequence, PairedEvalueStep, InterimReleaseConfidence, InterimReleaseConfidenceInput, SequentialDecision.
rubricPredictiveValidity({ runs, outcomes, outcomeMetrics }) joins canonical campaign RunRecords to a DeploymentOutcomeStore and reports per-rubric Pearson + Spearman + bootstrap CI against each outcome metric. Verdict bucketing: 'load_bearing' | 'informative' | 'decorative' based on |spearman|. Without this loop every rubric is faith-based; with it, you know which rubrics earn their promotion power and which are decoration.
const validity = await rubricPredictiveValidity({
runs: lastQuarterRuns,
outcomes: shipFlagOutcomeStore,
outcomeMetrics: ['revenue_lift', 'retention_30d', 'csat'],
})
for (const r of validity.ranked) {
console.log(`${r.rubric} → ${r.bestOutcome}: ρ=${r.spearman.toFixed(2)} (${r.verdict})`)
}Builds on the existing correlationStudy primitive but works directly off RunRecord (the canonical campaign artifact) rather than Run from a TraceStore, so it composes cleanly with runEvalCampaign's output. Returns a per-rubric ranking + every (rubric, outcome) pair tested + a list of rubrics that produced no usable data.
Exported from root and @tangle-network/agent-eval/reporting: rubricPredictiveValidity, RubricOutcomePair, RubricRanking, RubricPredictiveValidityInput, RubricPredictiveValidityReport. The existing correlationStudy, OutcomeStore, InMemoryOutcomeStore, FileSystemOutcomeStore continue to work unchanged.
Explicit opt-out from capture is no longer flagged by assertRunCaptured as no_raw_sink. Opt-out remains a deliberate choice; the campaign still requires the matching integrity overrides.
Every consumer that adopted agent-eval before 0.22 wrote their own matrix runner, and every one of them re-introduced the same forgettable wiring (raw sink, route guard, integrity assertion, analyst hook). 0.21 documented the pattern; 0.22 owns it. The four new primitives compound:
runEvalCampaignstandardises the artifact (RunRecord+ raw events + fingerprint).- Replay turns every past run into free training/validation data for new judges.
- Sequential evaluation makes "ship-when-evidence-says-so" mathematically defensible.
- Predictive validity converts evals from belief-based to outcome-anchored.
runMultiShotOptimization remains the right primitive for trajectory-shaped GEPA optimization sweeps; runPromptEvolution for prompt + code evolution loops with sandbox pools; runEvalCampaign for the "compare N variants on M scenarios with K seeds and tell me which to ship" case that makes up the bulk of consumer evals.
- Howard, S. R., Ramdas, A., McAuliffe, J., Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences. Annals of Statistics, 49(2), 1055–1080.
- Waudby-Smith, I., Ramdas, A. (2024). Estimating means of bounded random variables by betting. JRSS B, 86(1), 1–27.
Existing consumers do not need to change. All four primitives are additive. Recommended path: on the next eval-runner refactor, replace hand-rolled matrix loops with runEvalCampaign. Use evaluateInterimReleaseConfidence for any campaign you run on a recurring cadence. Wire rubricPredictiveValidity once you have ≥ 30 deployment outcomes joinable by runId. Replay is a free win — once campaigns are running, every eval R&D loop drops to CPU-bound.
This release closes the layer-1 gap a downstream consumer surfaced: better post-run statistics don't help if the underlying data wasn't captured. 0.21 adds first-class raw provider-event capture, a fail-loud route guard, a run-completion integrity check, and run-complete hooks (with a trace-analyst auto-execution helper) so a direct matrix run produces complete forensics without out-of-band glue.
RawProviderSink(capture). First-class persistence for HTTP-level provider request / response / error payloads alongside the structuredLlmSpan.InMemoryRawProviderSink,FileSystemRawProviderSink(NDJSON, rolls at 32 MiB), andNoopRawProviderSinkship in core. Default redactor stripsAuthorization/X-Api-Key/Cookieheaders and credential-shaped body fields (apiKey,bearer,password,secret,token); redacted paths are recorded onevent.redactedFieldsso a reviewer can see what was stripped without exposing values. Wired intocallLlmviaLlmClientOptions.rawSink— every retry attempt produces arequestand either aresponseorerrorevent with the attempt index attached.assertLlmRoute(route guard). Pure function that throwsLlmRouteAssertionErrorwhen the configured client doesn't match the caller's route requirements:requireExplicitBaseUrl,allowedBaseUrls,blockedBaseUrls,requireAuth,expectedProvider. Designed for the matrix-runner preflight — fail loud at the boundary instead of silently falling back to the public/free-tier router.assertRunCaptured(integrity check). Read-only check on(store, runId, expectations)that returns a structuredRunIntegrityReportwith issue codes (missing_llm_spans,missing_raw_events,orphan_llm_span,no_raw_sink,missing_outcome, …). Pair with the newrequireRawCoverageOfLlmSpansto assert everyLlmSpanhas a matching rawrequestevent. Use directly or viathrowIfRunIncompletefor strict mode.onRunCompletehooks onTraceEmitter. NewTraceEmitterOptions.onRunCompletearray fires afterendRun/abortRunwith full run context (run id, outcome, status, store, emitter). Errors are swallowed and recorded aslogevents by default; opt into propagation viahookErrors: 'throw'.addRunCompleteHookattaches hooks after construction.traceAnalystOnRunCompletefactory. Drop-in run-complete hook that runsanalyzeTracesafter each run and persists the result. Resolves the "trace analyst never ran on this matrix sweep" complaint by making auto-execution declarative.researchReport— executive research-report layer for coding-vertical benchmark runs (originally landed in #34, elevated in #35). ComposessummaryTable,paretoChart,gainHistogram, held-out gate decisions, and optionalfailureClusterViewoutput into one structured artifact: promote / hold / equivalent / reject / needs-more-data guidance with rationale, risks, next actions, markdown, HTML, and JSON chart specs.- Decisions are made on paired evidence — never on marginal means alone.
- ROPE (Region of Practical Equivalence) supported via the
ropeoption. - Bayesian-bootstrap-style
Pr(Δ>0)andPr(Δ∈ROPE)summaries (Rubin 1981). - Per-candidate minimum detectable paired effect via
pairedMde. - SHA-256
runFingerprintand optionalpreregistrationHashlinking a signedHypothesisManifest. - Embedded methodology +
docs/research-report-methodology.mdcompanion.
pairedMdeinpower-analysis: closed-form minimum detectable paired effect (inverse to the paired-t / sign-rank power formula).
researchReportis async (uses Web Crypto viahashJsonfor the run fingerprint).- Default
researchReport.minPairsis 20 (soft floor); hard floor of 6 is enforced regardless viaRESEARCH_REPORT_HARD_PAIR_FLOOR.
No wire-protocol changes. The new capture / integrity / hook primitives are TypeScript-only; cross-language consumers continue to use the existing RPC surface.
The PyPI distribution renamed from tangle-agent-eval to agent-eval-rpc, and the import path from tangle_agent_eval to agent_eval_rpc. The new name accurately describes the package — it is a thin RPC client over the Node runtime, not a Python re-implementation of the eval logic — and the npm scope (@tangle-network/agent-eval) already provides the namespacing the tangle- prefix was substituting for. No prior PyPI version ever shipped under the old name (Trusted Publisher misconfiguration; see issue #40), so this rename is a clean first publish rather than a migration.
Locked at agent-eval-rpc==0.21.0 to match the npm package.
hashRubricnow recursively sorts nested rubric fields before hashing, so dimension, failure-mode, and win changes alterrubricVersion.- Wire judge handling now validates LLM output before returning it: finite dimension scores, rationale, and known failure/win ids are enforced.
- Control-runtime budgets reject invalid numeric config, and invalid action
costs are omitted from step telemetry instead of leaking
NaN/Infinity. - Knowledge readiness now treats invalid
validUntiltimestamps as stale. - Trace-analyst regex search supports leading
(?i)and stops scanning once bounded match output is reached. - SWE-Bench Lite example wording now reflects the implemented external-grader adapter, with quoted command parsing and timeout coverage.
- Published package contents now include
CHANGELOG.md. - Public docs now use GitHub URLs for repository-only examples and Python client source.
- Publish CI now checks npm, Python package, runtime fallback version, and tag version agree before publishing.
- Initial
runAgentControlLoopobserve/validate failures now report the actual observe/validate error even when trace start/end emission also fails. - Knowledge readiness recommended actions now honor non-blocking gap
acquisition modes such as
ask_user,search_web,query_connector, andinspect_repo. - Npm builds now generate
dist/openapi.json, and the package exports@tangle-network/agent-eval/openapi.json. - Npm and Python client versions are locked at
0.20.9.
CallbackResearcher, a concrete callback-backed implementation of the stableResearcherinterface for scripts, tests, and small integrations.- Public
@tangle-network/agent-eval/benchmarkssubpath for the supported routing benchmark surface. - Root MIT
LICENSE.
- Raw TypeScript examples are no longer included in the npm package; they remain repository examples to read, copy, and adapt.
KnowledgeRequirement.validUntilandlastVerifiedAtfor explicit freshness contracts.scoreKnowledgeReadiness({ now })support for deterministic freshness gates.
- Expired knowledge requirements now score as missing even when confidence and evidence are otherwise high.
- First-class knowledge-readiness contracts:
KnowledgeRequirement,KnowledgeBundle,KnowledgeReadinessReport,UserQuestion, andDataAcquisitionPlan. scoreKnowledgeReadiness,blockingKnowledgeEval,userQuestionsForKnowledgeGaps, andacquisitionPlansForKnowledgeGaps.- Knowledge/data failure classes including
knowledge_readiness_blocked,missing_credentials,bad_retrieval,insufficient_evidence, andcontradictory_evidence. docs/knowledge-readiness.md, plus documented knowledge-related ASI responsible surfaces for multi-shot optimization.
evaluateReleaseConfidence, a conservative release scorecard over corpus coverage, search/holdout run evidence, ASI diagnostics, overfit checks, and cost/latency budgets.assertReleaseConfidence, a throwing variant for CI/release scripts.releaseTraceEvidenceFromMultiShotTrials, a helper that projectsMultiShotTrialResultrows into release trace evidence so single-shot and variable multi-shot apps use the same release gate.
- Removed the legacy pairwise prompt optimizer surface:
PromptOptimizer,OptimizationLoop, and their associated root-exported types are gone. The blessed optimization path is nowrunMultiShotOptimizationfor task trajectories and the steering-specific optimizers for explicit steering tables. - Removed the old
PromptVariantroot export. Public callers should useMultiShotVariantfor multi-shot trajectory optimization orEvolvableVariantfor the lower-level prompt/code evolution core.
- Documentation now points optimization users at
runMultiShotOptimizationinstead of the removed pairwise prompt optimizer.
runMultiShotOptimization, the canonical GEPA-style adapter for variable-length agent trajectories. It wrapsrunPromptEvolutionwhile preserving full multi-shot traces, actionable side information, stable paired seeds, score/cost objectives, and optional held-out promotion gating.trialTraceFromMultiShotTrial, a bridge from multi-shot trial results into reflective mutation prompts.ActionableSideInfo,MultiShotVariant,MultiShotTrace,MultiShotRun,MultiShotScore,MultiShotTrialResult,MultiShotMutateAdapter, and related public types.docs/multi-shot-optimization.mdandexamples/multi-shot-optimization/index.ts.
- The multi-shot result shape explicitly separates
searchBestVariantfrompromotedVariant. If a holdout gate rejects the search winner, the promoted variant is the baseline. runMultiShotOptimizationvalidates release-critical configuration up front: unique variant/scenario ids, positive integer run counts, population size, disjoint search/holdout ids, and a gate baseline key matching the first seed variant.
runAgentControlLoop, a genericobserve -> validate -> decide -> actruntime for agentic tasks with step, wall-clock, and recorded-cost budgets; no-progress and repeated-action stop policies; structured runtime failures; objective/subjective eval helpers; andTraceStoreemission.runProposeReviewAsControlLoop, a bridge preset that expresses propose/verify/review as a specialization of the generic control runtime.- feedback trajectory helpers for turning control-loop runs and user/judge labels into reusable dataset scenarios, optimizer rows, and preference memory.
docs/control-runtime.md, with integration patterns for tax, legal, agent-builder, and film-agent products.
- control runtime trace sink and
onStepcallback failures are now recorded as structured runtime errors without aborting an otherwise valid run. runProposeReviewAsControlLoopaccepts a caller-provided verifier failure mapper for domain-specific failure classes.
This release tightens the public benchmark surface and lands internal usage guidance that the v0.15 dispatch couldn't write.
src/benchmarks/gsm8k/→examples/benchmarks/gsm8k/src/benchmarks/swebench-lite/→examples/benchmarks/swebench-lite/
These are reference implementations of BenchmarkAdapter, not core surface. Consumers read them, copy them, adapt them. The novel routing benchmark stays in src/benchmarks/ because it's our own and broadly useful.
src/benchmarks/index.ts now exports the shared types + the routing benchmark only. The previous gsm8k and swebenchLite namespace exports are gone — import directly from examples/benchmarks/<name>/index.ts (or copy the wrapper into your own project).
examples/benchmarks/README.mddocuments how to use, copy, and extend the example wrappers.- Internal agent-eval usage guidance gains production-rigor and pitfalls sections covering the v0.16 primitives.
If you imported gsm8k or swebenchLite from @tangle-network/agent-eval/benchmarks:
// before
import { gsm8k, swebenchLite } from '@tangle-network/agent-eval/benchmarks'
// after — copy the file from examples/benchmarks/<name>/index.ts into your project,
// or import via relative path from the cloned repo.The routing benchmark and the shared BenchmarkAdapter types are unchanged.
The v0.15 primitives were framed as "paper-grade" but most are production-rigor utilities any team needs. This release renames the three reporting helpers and drops the "paper" framing from the public API. Behavior unchanged.
paperTable→summaryTableparetoFigure→paretoChartgainDistributionFigure→gainHistogramPaperTable/PaperTableOptions/PaperTableRowtypes →SummaryTable/SummaryTableOptions/SummaryTableRow- File:
src/paper-report.ts→src/summary-report.ts
Drop-in: search-and-replace the three function names and the file path. Type names follow the same pattern. No behavior change.
// before
import { paperTable, paretoFigure, gainDistributionFigure } from '@tangle-network/agent-eval'
// after
import { summaryTable, paretoChart, gainHistogram } from '@tangle-network/agent-eval'Substrate for the "Two Loops, Three Roles" paper on multi-level prompt optimization with held-out promotion gates.
HeldOutGate(src/promotion-gate.ts) — first-class held-out paired-delta promotion gate. Three checks: minimum productive runs, positive lower bound on bootstrap CI of paired holdout median delta, bounded overfit-gap relative to baseline. Decisions carry a machine-readablerejectionCode(few_runs|negative_delta|overfit_gap) plus anevidenceblock with every number the gate read. Generalizes the inline pattern that lived inredteam/scripts/agent-eval-autoresearch.ts:138–171.RunRecord(src/run-record.ts) — paper-grade JSON-friendly run schema with mandatory fields:runId,experimentId,candidateId,seed, snapshot-versionedmodel,promptHash,configHash,commitSha,wallMs,costUsd,tokenUsage,outcome,splitTag. Runtime validator (validateRunRecord,isRunRecord,parseRunRecordSafe,roundTripRunRecord) throws on missing fields and on bare model aliases without snapshot suffix.Researcher(src/researcher.ts) — stable hook for an autonomous-research agent:inspectFailures→proposeChange→applyChange→evaluateChange.NoopResearcheris the fail-loud placeholder. Implementations live downstream.- Reference benchmarks (
src/benchmarks/) — three adapters that share theBenchmarkAdapter<TItem, TPayload>shape:gsm8k: HF-mirror loader (JSONL viaAGENT_EVAL_GSM8K_PATH), exact-match grading viaparseGsm8kAnswer.swebench-lite: 30-instance subset stub. Loader readsAGENT_EVAL_SWEBENCH_PATH; grader shells out toAGENT_EVAL_SWEBENCH_GRADER_CMD. Both fail loud when unset.routing: synthetic 16-task router benchmark, ships in the package, dependency-free. Format documented insrc/benchmarks/routing/README.md.deterministicSplit(itemId, seed?): stable 60/20/20 split via FNV-1a hash. Default seedagent-eval-v1.
summaryTable,paretoChart,gainHistogram(sr./summary-report.ts) — Table 1 + Pareto + gain-distribution specs. Returns data structures (markdown table, point lists, histogram bins); caller picks the plotting library.runCanaries(src/canary.ts) — three liveness canaries: silent judge fallback (consecutive constant-confidence streak), judge calibration drift (KS test on confidence distribution), eval-set distribution shift (chi-square on category bucket counts).pairedBootstrap,pairedWilcoxon,bhAdjust(src/paired-stats.ts) — paper-style aliases + the missing paired bootstrap CI primitive. Deterministic with optional seed.
- No breaking changes. Every existing module is untouched; new types are additive.
- All new public symbols carry JSDoc.
- 87 new tests across 7 new test files. 571 total tests pass.
- See the package docs for usage directives and pitfalls.
intent-match + flow-layer + deploy-gate + concept complexity weighting.
LayerResult.diagnostics + buildReviewerPrompt +
createDefaultReviewer + mergeLayerResults options.
CommandRunner contract + multiToolchainLayer + Finding.detail.
probeLlm + keyword-coverage-judge. Honestly-absent primitives
backfilled — llm-client, multi-layer verifier, semantic concept judge,
extractor utilities.
Extracted muffled-gate scanner; CostTracker.recordVerdict. Footgun
fix: cwd belongs in HarnessConfig, not the driver constructor.
Tier 1 (meta-eval correlation, PRM, bisector), Tier 2 (counterfactual, cross-trace diff, pre-registration), Tier 3 (self-play, causal attribution, active learning, RM export), governance templates.