agent-eval is for deciding whether an agent run should pass, keep working, be
replayed, be optimized, or be promoted.
It exists because agent output is not evidence. A model can say a task is done while the build fails, the browser flow is broken, the integration was never connected, or the answer lacks required sources. The package gives products a shared way to record runs, check outcomes, classify failures, compare variants, and make release decisions.
| Thing | What it is | One-line example |
|---|---|---|
| Judge | A function that scores one piece of output. | "Did this scaffold implement async fetching?" |
| Rubric | The recipe a judge uses — what to score on, with what weights. | "Score on buyer_quality (0.5), voice (0.3), signal (0.2)." |
| Verifier | A pipeline of judges run in order, with dependencies. | "install → typecheck → build → semantic" |
| Feedback trajectory | A multi-shot record of attempts, approvals, rejections, edits, metrics, and policy outcomes. | "draft → user rejects → revised draft → approved → measured" |
Everything else exists to make those objects useful in real product loops: traces, datasets, control runtime, optimizers, statistics, and reports.
When the thing being evaluated is an agent that should keep working, use
runAgentControlLoop. It turns validators into a
runtime loop: observe typed state, validate it, decide the next action, act,
and repeat until the task passes, blocks, times out, spends too much, or stops
making progress.
When normal agent usage should become reusable training/eval signal, use
FeedbackTrajectory. It captures approvals,
rejections, edits, option choices, metrics, and policy blocks as portable data
that can seed memory, replay scenarios, and optimization.
| Term | Plain English |
|---|---|
| Artifact | The thing being judged. Often a workdir of files, sometimes a string of text. |
| Snapshot | A frozen view of an artifact (every file path → content). What the judge actually reads. |
| Harness | A description of how to run the artifact: setup command, test command, working dir, timeout. |
| Sandbox driver | The thing that actually executes commands inside the harness. Local subprocess, or remote container. |
| Layer | One stage of a verifier pipeline (install, typecheck, build, semantic, …). |
| Finding | A specific issue a judge found — file, line, severity, message. |
| Trace store | The append-only log of every span/event during a run. Replay = read this back. |
| Composite score | A 0..1 number combining all dimensions. The single number you gate on. |
| Rubric version | A stable hash of the rubric. Scores from different rubric versions are not comparable. |
| Muffled gate | A check that should fail loud but silently passes (e.g. `command |
For agentic systems, the highest-quality labels often come from normal review workflow, not a separate labeling UI:
agent proposes -> user approves/rejects/edits/selects -> agent revises -> outcome is measured
FeedbackTrajectory is the portable record of that loop. Browser agents can
store task outcomes, coding agents can store patch review plus test results,
and research agents can store reviewer corrections. The domain changes; the
shape stays the same.
Those trajectories can be converted into preference memory, DatasetScenario
rows, optimizer rows, and held-out examples for overfit checks.
When the artifact is generated code, agent-eval scores it at three independent layers. Each layer fails differently, and you want to know which one broke:
L0 builder Did the agent's session itself work?
(Did it produce an artifact at all?)
│
▼
L1 app-build Does the artifact build / typecheck / test?
(Static signal, ground-truth gate.)
│
▼
L2 app-runtime Does the artifact actually run end-to-end?
(Dynamic signal — only worth checking if L1 passed.)
BuilderSession orchestrates this. It opens at startChat, runs the build at ship, runs the runtime check at runAppScenario. Each layer emits a trace span. Composite score aggregates them with scoreProject.
Why three? Because each catches a different failure mode:
- L0 misses — agent crashed mid-generation, you have a half-written file.
- L1 misses — files exist but typecheck fails. LLM judges can't reliably catch this.
- L2 misses — code compiles but does the wrong thing at runtime.
If you only check one layer, you ship the bugs that the other two layers would have caught.
A rubric describes:
- Dimensions — the axes you score on (e.g.
buyer_quality,voice,signal). - Weights — how to combine dimensions into a composite (
0.5 * buyer_quality + 0.3 * voice + 0.2 * signal). - Failure modes — named patterns the judge looks for ("ai-cadence", "vague-claim").
- Wins — named positive patterns ("specific-component", "earned-detail").
- System prompt — what to tell the judging LLM about the persona and the task.
Built-in rubrics ship in src/wire/rubrics.ts (e.g. anti-slop for technical-buyer voice). You can also pass a rubric inline — the same shape, just defined at the call site.
A rubric is plain data. The hash of that data is the rubricVersion. Two scores are only comparable if they used the same rubricVersion — change the rubric and you start a new comparison series.
When you have a multi-step pipeline (install → typecheck → build → lint → semantic), use MultiLayerVerifier:
const verifier = new MultiLayerVerifier([
installLayer, // runs `pnpm install`
typecheckLayer, // runs `tsc --noEmit`, depends on install
buildLayer, // runs `pnpm build`, depends on typecheck
semanticLayer, // LLM judge, weight 3, depends on build
])
const report = await verifier.run({ env: { runner, workdir, ... } })
report.allPass // boolean — every layer passed
report.blendedScore // 0..1 — weighted aggregate
report.layers // per-layer status, findings, durationTwo rules that will save you bugs:
-
Run both gates. Build gates catch code that doesn't compile; structural assertions catch missing files. Run both unconditionally — they catch orthogonal failures.
-
Pair LLM judges with build outcomes. An LLM judge will rate non-compiling code as "looks right" (0.8). Always short-circuit on
buildOutcome.passed === falsebefore any LLM judging.
Two questions to answer before trusting any LLM judge:
- Does it agree with humans?
calibrateJudge(golden, candidate)reports Pearson, MAE, integer-rounded κ, and worst-N miscalibrations vs a human golden set. - Does it agree with itself / other judges?
continuousAgreement(scores)andcalibrateJudgeContinuous(golden, candidate)report κ_w + ICC(2,1) + Pearson + Spearman with bootstrap 95% CIs on the raw [0,1] scores.
Why two κ flavours: the original calibrateJudge rounds scores to ints before computing κ. For fine-grained judges that loses information — 0.78 vs 0.81 both round to "1" and look perfectly agreed. Use calibrateJudgeContinuous (or continuousAgreement for N≥2 raters) when scores are continuous. ICC(2,1) catches systematic bias that Pearson misses: if judge B scores 2× judge A, Pearson stays ≈ 1 while ICC drops — that's the signal.
Bias probes (positionalBias, verbosityBias, selfPreference) cover the orthogonal failure modes: position-dependent scoring, length-correlated scoring, and judge-prefers-its-own-family.
Every operation emits structured spans into a TraceStore. A run is a tree:
builder-session [span]
├── chat-turn [span]
├── ship [span]
│ ├── harness.install [span]
│ ├── harness.typecheck [span]
│ └── harness.build [span]
└── app-runtime [span]
└── scenario.run [span]
Spans are append-only and have stable ids — replay is reading the same store back. OTLP export ships them out for distributed tracing.
You usually should not build this tree by hand. Product runtimes,
runAgentControlLoop, harnesses, and verifiers should emit it while they run.
Use traces when debugging a flaky run, building replay data, or explaining a
release decision.
- Need the layman feature map? → feature-guide.md — what each primitive does, when to use it, integration patterns, and guardrails.
- Just want to score a string against a rubric? → wire-protocol.md — HTTP/RPC interface, pluggable from any language.
- Need a reusable driver/worker/evaluator loop? → control-runtime.md — generic runtime plus coding, browser, computer-use, and research integration patterns.
- Want review feedback to become eval/optimization data? → feedback-trajectories.md — turn feedback into datasets, optimizer rows, and preference memory.
- Building a code-generator eval? → Start with
BuilderSession,SandboxHarness, andMultiLayerVerifier. - Multi-layer verifier? → Use control-runtime.md and
MultiLayerVerifierfor ordered gates with dependencies. - Adding a new judge or rubric? →
src/wire/rubrics.tsfor the cross-language path;src/anti-slop.tsandsrc/judges.tsfor the in-process path.