tml-2735: golden-case library + run-one-brief harness by wmadden · Pull Request #641 · prisma/prisma-next

wmadden · 2026-05-30T16:22:47Z

Decision

A golden-case library + a minimal live harness is how we accrete the instrumented-run corpus the Drive LLM judge (TML-2736) calibrates against, and the run-spawning mechanism the experiment engine (TML-2737) builds the k=N A/B loop on. This slice ships both, plus validates the post-hoc trace parser (clears TML-2728). Foundation slice of project Drive — Judge + live-experiment harness (Parallel group F). Stacked on planning PR #638.

What's here

1. Golden-case library — `projects/drive-judge-harness/assets/golden/`

5 canonical Drive briefs spanning the Drive-shape space, each with case.json + brief.md + acceptance.md (expected verdict + requirements + correctness oracle) + a pre-written drive-qa-plan manual-qa.md (so the Tier-1 QA signal is deterministic at run time):

Slug	Shape	Probes
`direct-change-diagnostic-wording`	direct change	smallest legitimate unit; no spec/plan ceremony
`slice-cli-list-flag`	single slice	one coherent PR; shared-source design quality
`project-retry-policy`	multi-slice project	sequencing, stacking, explicit opt-in
`i12-halt-storage-assumption`	I12 halt / re-plan	brief built on a false premise — a correct run halts + re-plans, doesn't confabulate a green-but-wrong result
`spike-first-flaky-test`	spike-first	unknown cause — spike before fixing, no evidence-free mask

2. `run-one-brief` harness — `skills-contrib/drive-judge-harness/`

Spawns one orchestrator run on a golden brief with a pinned model, accumulates per-run token usage from the SDK's turn-ended usage (inputTokens/outputTokens/cacheReadTokens/cacheWriteTokens), and writes a run manifest.

No-API-key-in-CI guarantee. Live execution requires both --live and CURSOR_API_KEY; the default is a dry-run that makes no network call. @cursor/sdk is reached only through sdk-adapter.ts's dynamic import on the live path, so typecheck / test / lint / CI all stay green with no key set and @cursor/sdk not installed. Tests inject a mock createAgent (zero live calls). The token accumulator and manifest writer are pure and unit-tested.

3. Post-hoc parser validation (clears TML-2728)

validate-parser.ts runs drive-diagnose-run/posthoc.ts over a ≥3-transcript corpus; projects/drive-judge-harness/slices/golden-case-harness/parser-validation.md records per-event confidence. Result: 12 events reconstructed — 0 high · 6 medium (dispatch existence) · 6 low (spec/plan authoring). The parser is structurally capped at medium (transcripts lack ground-truth envelope fields); it is robust to sparse and rich transcripts; no behaviour bug surfaced — posthoc.ts unchanged.

Dependency note (TML-2720)

The canonical tokens trace field is owned by the parallel slice TML-2720 (which owns schema.ts/metrics.ts/report.ts — untouched here). Until that schema field lands, the harness writes token totals to a run manifest beside the trace rather than emitting an unvalidatable trace line through the fail-closed emitter. When TML-2720 lands, the manifest's tokens migrates into the validated trace via emit.ts.

Deferred / out of scope

k=N A/B engine, cross-run aggregation, dashboard, CI regression gate → TML-2737.
LLM judge / calibration → TML-2736.
Two-tier scorecard + tokens/correctness schema additions → TML-2720.

Operator-gated boundary (blocker reported, not worked around)

pnpm add @cursor/sdk fails the repo's trustPolicy: no-downgrade guard on a transitive undici@5.29.0 (an earlier version had provenance attestation this one lacks). Admitting a supply-chain-flagged package + adding a trustPolicyExclude entry is an operator decision with repo-wide blast radius, and live execution needs a CURSOR_API_KEY regardless. So the SDK is fully isolated, the dep is not added to the lockfile, and live execution is operator-gated — the harness ships fully functional in dry-run/mock form. Resolution documented in the harness SKILL.md and the slice spec's open question.

Gates (green)

pnpm typecheck (138/138) · pnpm lint:deps · pnpm lint:casts (delta 0) · pnpm lint:skills · pnpm test:scripts (468 tests incl. 34 new) · dry-run verified with no CURSOR_API_KEY.

Summary by CodeRabbit

New Features
- Added a run harness + CLI to execute or dry-run canonical Drive briefs, write per-run manifests, and aggregate token usage.
Tests
- Expanded test coverage for loading briefs, manifests, run behavior (live/dry-run), token accumulation, and transcript validation.
Documentation
- Added harness documentation and a KNOWN-ISSUES note describing an upstream TypeScript declaration problem and suggested workarounds.
Chores
- Updated test scripts to include additional suites and added a dev dependency to support runtime integration.

Scaffold the "Drive — Judge + live-experiment harness" project workspace: two-tier correctness-first scorecard, an LLM judge calibrated against an accreting instrumented-run corpus, and an SDK-spawned k=N A/B harness. Four slices (TML-2720 scorecard+vocabulary, TML-2735 golden-case harness, TML-2736 judge, TML-2737 experiment engine); two foundation slices run in parallel, judge + engine stack on top. Trace.jsonl carries the first natively-instrumented project-started/spec-authored/plan-authored events, emitted via the deterministic emitter merged in PR #633. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

… spike frameworks Operator steer: keep the implementation minimal. Default to a bespoke LLM-judge + held-out agreement tally; adopt a third-party eval framework (Inspect/Braintrust/promptfoo) only if a time-boxed slice-3 spike shows it reduces net complexity. Run-production harness stays bespoke regardless. Adds spec non-goal + Open Question 6, design-notes alternative, plan slice-3 spike note, and the spike to TML-2736. Trace carries spec-amended/plan-amended. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

Settle the six project-level open questions into decisions: - one project (judge + harness kept together; the feed->consume loop is the project) - judge model cross-family (hard); default GPT 5.5 vs the Claude orchestrator - per-run token signal from the SDK TurnEndedUpdate.usage - composed correctness gate (validation gates + QA run + judge intent); CI/merge is real-PR-only since sandboxed runs cannot use CI without an isolated fork - QA plans pre-written in each golden case acceptance set - baseline = previous skill version - bespoke-minimal scorer; slice-3 spike gates any framework on a net-complexity win Open Questions section now empty; decisions logged in spec + design-notes. Trace carries spec-amended. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

…lidation Add a minimal Cursor-SDK harness that spawns one Drive orchestrator run on a golden brief with a pinned model, accumulates per-run token usage from the SDK turn-ended usage, and writes a run manifest. Live execution is gated behind --live + CURSOR_API_KEY; the default dry-run path makes no live call and the SDK is reached only via a dynamically-imported adapter, so typecheck/test/lint and CI stay green with no key and no @cursor/sdk installed. Token totals land in a run manifest beside the trace (transitional home) since the canonical tokens trace field is owned by TML-2720; no schema change here. Also add validate-parser over a >=3-transcript corpus, clearing TML-2728. Wire the new node --test suites into test:scripts and add drive:run-brief / drive:validate-parser shortcuts. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

Curate 5 canonical Drive briefs with co-located acceptance sets and pre-written drive-qa-plan scripts under projects/drive-judge-harness/assets/golden/, spanning the Drive-shape space: a direct change, a single in-project slice, a small multi-slice project, an I12-halt/re-plan case (brief built on a false storage-capability premise), and a spike-first case (flaky test, unknown cause). Each case ships case.json + brief.md + acceptance.md + manual-qa.md so the Tier-1 correctness signal (validation gates + QA run + judge intent) is deterministic at run time. Add the slice spec/plan, the slice-scoped Drive trace, and parser-validation.md recording per-event confidence over the transcript corpus (clears TML-2728). Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

coderabbitai · 2026-05-30T16:22:55Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 411aa85e-08f4-4808-b0a1-4fa90416ee68

📥 Commits

Reviewing files that changed from the base of the PR and between 393b38b and 717d13e.

📒 Files selected for processing (2)

skills-contrib/drive-judge-harness/run-one-brief.ts
skills-contrib/drive-judge-harness/test/run-one-brief.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

skills-contrib/drive-judge-harness/test/run-one-brief.test.ts
skills-contrib/drive-judge-harness/run-one-brief.ts

📝 Walkthrough

Walkthrough

Adds a drive-judge-harness package providing dry-run/live execution of a canonical Drive brief, token accumulation, run manifests, an SDK adapter to safely interact with @cursor/sdk at runtime, post-hoc transcript validation, tests, fixtures, and documentation; also updates root package.json and pnpm workspace trust config.

Changes

Drive Judge Harness

Layer / File(s)	Summary
Token usage types and accumulation `skills-contrib/drive-judge-harness/usage.ts`, `skills-contrib/drive-judge-harness/test/usage.test.ts`	Defines `TurnUsage` and `TokenTotals`, `emptyTotals()`, and `accumulateUsage()` which coerce missing/non-finite values to zero and compute `totalTokens`.
Golden case loading and validation `skills-contrib/drive-judge-harness/load-brief.ts`, `skills-contrib/drive-judge-harness/test/load-brief.test.ts`	Parses `case.json` metadata with strict field checks, reads `brief.md`, returns `GoldenCase`, and throws descriptive errors for missing/invalid files; tests cover real golden cases and error paths.
Run manifest schema and persistence `skills-contrib/drive-judge-harness/manifest.ts`, `skills-contrib/drive-judge-harness/test/manifest.test.ts`	Defines `RunManifest`/`RunStatus`, supports optional `tokens`, and implements `writeManifest()` that pretty-prints JSON with a trailing newline and creates parent dirs; tests verify round-trip, newline, nested dirs, and tokens.
Cursor SDK adapter and type guards `skills-contrib/drive-judge-harness/sdk-adapter.ts`	Runtime-only adapter with structural guards to extract usage/text, normalize streamed messages and outcomes, wrap SDK run with `stream()`/`wait()`, and export `createCursorAgent()` for live execution while avoiding broken SDK declarations at typecheck time.
Main brief orchestration and CLI `skills-contrib/drive-judge-harness/run-one-brief.ts`, `skills-contrib/drive-judge-harness/test/run-one-brief.test.ts`	Implements `runOneBrief` and a small CLI: dry-run vs live gating (`--live` + CURSOR_API_KEY), lazy SDK import, event streaming to accumulate tokens, startup-failure handling, manifest writing with final status/tokens, and exit codes; tests cover dry-run gate, live-path with mocked runs, failures, and prompt assembly.
Post-hoc transcript validation `skills-contrib/drive-judge-harness/validate-parser.ts`, `skills-contrib/drive-judge-harness/test/validate-parser.test.ts`	Adds `validateFixtures()` to run the post-hoc parser over transcript JSONL fixtures, aggregate reconstructed events by type/confidence, render a Markdown report, and expose a CLI; includes tests validating counts and rendered output.
Documentation, fixtures, and workspace config `skills-contrib/drive-judge-harness/SKILL.md`, `skills-contrib/drive-judge-harness/KNOWN-ISSUES.md`, `skills-contrib/drive-judge-harness/test/fixtures/transcripts/*`, `package.json`, `pnpm-workspace.yaml`	Adds package docs and KNOWN-ISSUES noting upstream SDK d.ts problems, updates/adds transcript fixtures, extends root test script list, adds `@cursor/sdk` to devDependencies, and whitelists `undici@5.29.0` in the pnpm trust policy.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

prisma/prisma-next#633: The root test:scripts expansion includes the deterministic trace emitter test referenced by this change.
prisma/prisma-next#508: Also modified root package.json test script to include additional node --test suites.

Suggested reviewers

aqrln

Poem

🐰 A harness built to judge and test,
Dry runs first, then live's the quest,
Tokens counted, manifests saved,
Transcripts parsed, reports engraved,
Briefs judged clean — the rabbit’s impressed!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly summarizes the main changes: addition of a golden-case library and run-one-brief harness for the Drive-Judge project.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tml-2735-golden-case-harness

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-30T16:24:55Z

size-limit report 📦

Path	Size
postgres / no-emit	135.35 KB (0%)
postgres / emit	125.16 KB (0%)
mongo / no-emit	73.9 KB (0%)
mongo / emit	68.89 KB (0%)

pkg-pr-new · 2026-05-30T16:25:37Z

Open in StackBlitz

@prisma-next/extension-author-tools

npm i https://pkg.pr.new/@prisma-next/extension-author-tools@641

@prisma-next/mongo-runtime

npm i https://pkg.pr.new/@prisma-next/mongo-runtime@641

@prisma-next/family-mongo

npm i https://pkg.pr.new/@prisma-next/family-mongo@641

@prisma-next/sql-runtime

npm i https://pkg.pr.new/@prisma-next/sql-runtime@641

@prisma-next/family-sql

npm i https://pkg.pr.new/@prisma-next/family-sql@641

@prisma-next/extension-arktype-json

npm i https://pkg.pr.new/@prisma-next/extension-arktype-json@641

@prisma-next/extension-cipherstash

npm i https://pkg.pr.new/@prisma-next/extension-cipherstash@641

@prisma-next/middleware-cache

npm i https://pkg.pr.new/@prisma-next/middleware-cache@641

@prisma-next/mongo

npm i https://pkg.pr.new/@prisma-next/mongo@641

@prisma-next/extension-paradedb

npm i https://pkg.pr.new/@prisma-next/extension-paradedb@641

@prisma-next/extension-pgvector

npm i https://pkg.pr.new/@prisma-next/extension-pgvector@641

@prisma-next/extension-postgis

npm i https://pkg.pr.new/@prisma-next/extension-postgis@641

@prisma-next/postgres

npm i https://pkg.pr.new/@prisma-next/postgres@641

@prisma-next/sql-orm-client

npm i https://pkg.pr.new/@prisma-next/sql-orm-client@641

@prisma-next/sqlite

npm i https://pkg.pr.new/@prisma-next/sqlite@641

@prisma-next/target-mongo

npm i https://pkg.pr.new/@prisma-next/target-mongo@641

@prisma-next/adapter-mongo

npm i https://pkg.pr.new/@prisma-next/adapter-mongo@641

@prisma-next/driver-mongo

npm i https://pkg.pr.new/@prisma-next/driver-mongo@641

@prisma-next/contract

npm i https://pkg.pr.new/@prisma-next/contract@641

@prisma-next/utils

npm i https://pkg.pr.new/@prisma-next/utils@641

@prisma-next/config

npm i https://pkg.pr.new/@prisma-next/config@641

@prisma-next/errors

npm i https://pkg.pr.new/@prisma-next/errors@641

@prisma-next/framework-components

npm i https://pkg.pr.new/@prisma-next/framework-components@641

@prisma-next/operations

npm i https://pkg.pr.new/@prisma-next/operations@641

@prisma-next/ts-render

npm i https://pkg.pr.new/@prisma-next/ts-render@641

@prisma-next/contract-authoring

npm i https://pkg.pr.new/@prisma-next/contract-authoring@641

@prisma-next/ids

npm i https://pkg.pr.new/@prisma-next/ids@641

@prisma-next/psl-parser

npm i https://pkg.pr.new/@prisma-next/psl-parser@641

@prisma-next/psl-printer

npm i https://pkg.pr.new/@prisma-next/psl-printer@641

@prisma-next/cli

npm i https://pkg.pr.new/@prisma-next/cli@641

@prisma-next/cli-telemetry

npm i https://pkg.pr.new/@prisma-next/cli-telemetry@641

@prisma-next/emitter

npm i https://pkg.pr.new/@prisma-next/emitter@641

@prisma-next/migration-tools

npm i https://pkg.pr.new/@prisma-next/migration-tools@641

prisma-next

npm i https://pkg.pr.new/prisma-next@641

@prisma-next/vite-plugin-contract-emit

npm i https://pkg.pr.new/@prisma-next/vite-plugin-contract-emit@641

@prisma-next/mongo-codec

npm i https://pkg.pr.new/@prisma-next/mongo-codec@641

@prisma-next/mongo-contract

npm i https://pkg.pr.new/@prisma-next/mongo-contract@641

@prisma-next/mongo-value

npm i https://pkg.pr.new/@prisma-next/mongo-value@641

@prisma-next/mongo-contract-psl

npm i https://pkg.pr.new/@prisma-next/mongo-contract-psl@641

@prisma-next/mongo-contract-ts

npm i https://pkg.pr.new/@prisma-next/mongo-contract-ts@641

@prisma-next/mongo-emitter

npm i https://pkg.pr.new/@prisma-next/mongo-emitter@641

@prisma-next/mongo-schema-ir

npm i https://pkg.pr.new/@prisma-next/mongo-schema-ir@641

@prisma-next/mongo-query-ast

npm i https://pkg.pr.new/@prisma-next/mongo-query-ast@641

@prisma-next/mongo-orm

npm i https://pkg.pr.new/@prisma-next/mongo-orm@641

@prisma-next/mongo-query-builder

npm i https://pkg.pr.new/@prisma-next/mongo-query-builder@641

@prisma-next/mongo-lowering

npm i https://pkg.pr.new/@prisma-next/mongo-lowering@641

@prisma-next/mongo-wire

npm i https://pkg.pr.new/@prisma-next/mongo-wire@641

@prisma-next/sql-contract

npm i https://pkg.pr.new/@prisma-next/sql-contract@641

@prisma-next/sql-errors

npm i https://pkg.pr.new/@prisma-next/sql-errors@641

@prisma-next/sql-operations

npm i https://pkg.pr.new/@prisma-next/sql-operations@641

@prisma-next/sql-schema-ir

npm i https://pkg.pr.new/@prisma-next/sql-schema-ir@641

@prisma-next/sql-contract-psl

npm i https://pkg.pr.new/@prisma-next/sql-contract-psl@641

@prisma-next/sql-contract-ts

npm i https://pkg.pr.new/@prisma-next/sql-contract-ts@641

@prisma-next/sql-contract-emitter

npm i https://pkg.pr.new/@prisma-next/sql-contract-emitter@641

@prisma-next/sql-lane-query-builder

npm i https://pkg.pr.new/@prisma-next/sql-lane-query-builder@641

@prisma-next/sql-relational-core

npm i https://pkg.pr.new/@prisma-next/sql-relational-core@641

@prisma-next/sql-builder

npm i https://pkg.pr.new/@prisma-next/sql-builder@641

@prisma-next/target-postgres

npm i https://pkg.pr.new/@prisma-next/target-postgres@641

@prisma-next/target-sqlite

npm i https://pkg.pr.new/@prisma-next/target-sqlite@641

@prisma-next/adapter-postgres

npm i https://pkg.pr.new/@prisma-next/adapter-postgres@641

@prisma-next/adapter-sqlite

npm i https://pkg.pr.new/@prisma-next/adapter-sqlite@641

@prisma-next/driver-postgres

npm i https://pkg.pr.new/@prisma-next/driver-postgres@641

@prisma-next/driver-sqlite

npm i https://pkg.pr.new/@prisma-next/driver-sqlite@641

commit: 717d13e

…k runtime Install @cursor/sdk as a dev dependency and import `Agent` as a runtime value in sdk-adapter.ts, replacing the previous hand-rolled structural mirror of the SDK. @cursor/sdk@1.0.15 ships .d.ts that re-export from unpublished @anysphere/* packages, so the SDK's own types (incl. TurnEndedUpdate, the token-usage carrier) are unresolvable. The adapter therefore uses the real runtime API and reads only the fields it consumes (usage, outcome) through runtime guards over `unknown` — no fabricated type surface. KNOWN-ISSUES.md documents the upstream bug and the swap-in path once resolvable types ship. Installing the SDK trips the repo's `trustPolicy: no-downgrade` on a transitive undici@5.29.0 (dropped provenance attestation on a newer publish); admitted via the documented trustPolicyExclude hatch (dev-only). The live path stays gated on --live + CURSOR_API_KEY; sdk-adapter is loaded lazily by run-one-brief, so tests/dry-run never require the key or the SDK. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

skills-contrib/drive-judge-harness/SKILL.md (1)
20-20: ⚡ Quick win

Consider removing or generalizing milestone ticket references.

Line 20 (and lines 36, 94) reference specific TML-* ticket numbers (TML-2736, TML-2737, TML-2728, TML-2720). These milestone identifiers may become stale as the project evolves. Consider either removing them or replacing with more durable references (e.g., links to stable architecture docs or feature descriptions that age well).

Based on learnings: Do not reference transient project artifacts from durable system documentation; this pattern applies to milestone references that evolve over time.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills-contrib/drive-judge-harness/SKILL.md` at line 20, The SKILL.md text
includes transient milestone ticket references (TML-2736, TML-2737, TML-2728,
TML-2720); update the document to remove or generalize these identifiers by
either deleting the TML-* tokens or replacing them with durable references
(e.g., links to stable architecture docs, RFCs, or descriptive phrases like
"judge calibration milestone" or "experiment engine feature") so the content
ages well; search for the TML-* tokens in SKILL.md (lines that mention "judge",
"experiment engine", or specific milestone names) and apply the replacement.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills-contrib/drive-judge-harness/run-one-brief.ts`:
- Around line 142-148: The loop over run.stream() and the subsequent await
run.wait() can throw and currently prevent writing the manifest; wrap the entire
"for await (const event of run.stream()) { ... }" plus "const outcome = await
run.wait()" sequence in a try/catch (or try/catch/finally) inside the same
function that calls createAgent (referencing run.stream(), run.wait(),
usageUpdates and TurnUsage), and in the catch record an outcome/status of
'error' and capture error.message into notes; in the finally ensure the manifest
is always written using the accumulated usageUpdates (summarize tokens from
usageUpdates for the manifest tokens field) so that even on runtime failures the
harness writes a manifest with status:'error' and the collected token usage.

In `@skills-contrib/drive-judge-harness/SKILL.md`:
- Around line 17-21: The package README (SKILL.md) currently hardcodes a
transient project path for the "golden-case library"; remove the explicit
projects/... path reference and either (a) replace it with a brief, generic
description of the golden-case library and its purpose/structure (e.g., "the
golden-case library contains briefs, acceptance sets, and QA plans used by the
harness"), or (b) point readers to a stable architecture or docs/ link if one
exists, or (c) if these cases are intended to be permanent, move the assets to a
durable package/location and update SKILL.md to reference that durable location;
update the sentence that mentions the "golden-case library" to reflect one of
these options.

In `@skills-contrib/drive-judge-harness/test/validate-parser.test.ts`:
- Line 5: The import statement bringing renderMarkdown and validateFixtures from
'../validate-parser.ts' includes a .ts extension which violates the TypeScript
import style; update the import to reference the module without the extension
(import { renderMarkdown, validateFixtures } from '../validate-parser') so the
symbols renderMarkdown and validateFixtures are imported from the extension-less
path.

In `@skills-contrib/drive-judge-harness/validate-parser.ts`:
- Line 3: The import statement currently includes a `.ts` file extension which
violates project TypeScript import conventions; remove the extension from the
import so it reads from '../drive-diagnose-run/posthoc' and keep the named
imports (Confidence, parseTranscript) intact to avoid changing symbol references
used elsewhere in validate-parser.ts.

---

Nitpick comments:
In `@skills-contrib/drive-judge-harness/SKILL.md`:
- Line 20: The SKILL.md text includes transient milestone ticket references
(TML-2736, TML-2737, TML-2728, TML-2720); update the document to remove or
generalize these identifiers by either deleting the TML-* tokens or replacing
them with durable references (e.g., links to stable architecture docs, RFCs, or
descriptive phrases like "judge calibration milestone" or "experiment engine
feature") so the content ages well; search for the TML-* tokens in SKILL.md
(lines that mention "judge", "experiment engine", or specific milestone names)
and apply the replacement.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 0c497b72-0cfe-4daf-8f29-12a8b8f1acbf

📥 Commits

Reviewing files that changed from the base of the PR and between a91c750 and f7b2c12.

⛔ Files ignored due to path filters (30)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
projects/drive-judge-harness/assets/golden/README.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/direct-change-diagnostic-wording/acceptance.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/direct-change-diagnostic-wording/brief.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/direct-change-diagnostic-wording/case.json is excluded by !projects/**
projects/drive-judge-harness/assets/golden/direct-change-diagnostic-wording/manual-qa.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/i12-halt-storage-assumption/acceptance.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/i12-halt-storage-assumption/brief.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/i12-halt-storage-assumption/case.json is excluded by !projects/**
projects/drive-judge-harness/assets/golden/i12-halt-storage-assumption/manual-qa.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/project-retry-policy/acceptance.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/project-retry-policy/brief.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/project-retry-policy/case.json is excluded by !projects/**
projects/drive-judge-harness/assets/golden/project-retry-policy/manual-qa.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/slice-cli-list-flag/acceptance.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/slice-cli-list-flag/brief.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/slice-cli-list-flag/case.json is excluded by !projects/**
projects/drive-judge-harness/assets/golden/slice-cli-list-flag/manual-qa.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/spike-first-flaky-test/acceptance.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/spike-first-flaky-test/brief.md is excluded by !projects/**
projects/drive-judge-harness/assets/golden/spike-first-flaky-test/case.json is excluded by !projects/**
projects/drive-judge-harness/assets/golden/spike-first-flaky-test/manual-qa.md is excluded by !projects/**
projects/drive-judge-harness/design-notes.md is excluded by !projects/**
projects/drive-judge-harness/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/golden-case-harness/parser-validation.md is excluded by !projects/**
projects/drive-judge-harness/slices/golden-case-harness/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/golden-case-harness/spec.md is excluded by !projects/**
projects/drive-judge-harness/slices/golden-case-harness/trace.jsonl is excluded by !projects/**
projects/drive-judge-harness/spec.md is excluded by !projects/**
projects/drive-judge-harness/trace.jsonl is excluded by !projects/**

📒 Files selected for processing (18)

package.json
pnpm-workspace.yaml
skills-contrib/drive-judge-harness/KNOWN-ISSUES.md
skills-contrib/drive-judge-harness/SKILL.md
skills-contrib/drive-judge-harness/load-brief.ts
skills-contrib/drive-judge-harness/manifest.ts
skills-contrib/drive-judge-harness/run-one-brief.ts
skills-contrib/drive-judge-harness/sdk-adapter.ts
skills-contrib/drive-judge-harness/test/fixtures/transcripts/direct-change-diagnostic-wording.transcript.jsonl
skills-contrib/drive-judge-harness/test/fixtures/transcripts/project-retry-policy.transcript.jsonl
skills-contrib/drive-judge-harness/test/fixtures/transcripts/slice-cli-list-flag.transcript.jsonl
skills-contrib/drive-judge-harness/test/load-brief.test.ts
skills-contrib/drive-judge-harness/test/manifest.test.ts
skills-contrib/drive-judge-harness/test/run-one-brief.test.ts
skills-contrib/drive-judge-harness/test/usage.test.ts
skills-contrib/drive-judge-harness/test/validate-parser.test.ts
skills-contrib/drive-judge-harness/usage.ts
skills-contrib/drive-judge-harness/validate-parser.ts

…harness Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com> # Conflicts: # package.json

…istory Replace the three synthetic normal-shape golden cases with cases drawn from real merged PRs, so the corpus measures Drive runs against work the team actually shipped rather than synthesised tasks: - direct-change-example-emit-outputpath (TML-2722 / #618) - slice-dedupe-generated-imports (TML-2714 / #614) - project-reap-subsumed-ir-surfaces (TML-2727 / #630, #631, #629) — a three-slice parallel fan-out that exercises planner parallelisation and scope discipline. Each real case carries the task as posed (Linear ticket, solution-scrubbed so the run still does the design/planning), a base_sha to run against, and a reference.md describing the known-good output by commit SHA (the output itself is fetchable via git diff <base_sha> <merge_sha>). case.json gains source + base_sha; the loader ignores the extra fields until the experiment-engine slice wires base_sha into a checkout. The two pathological cases (i12-halt, spike-first) stay synthetic: no clean merged PR exhibits a halted or spiked run. Update harness tests, SKILL.md examples, and the corpus README for the renamed slugs. validate-parser fixtures are left as-is — they are synthetic parser fixtures with tuned event counts, not corpus members. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

…tream throws Only createAgent was wrapped; a throw from run.stream() iteration or run.wait() escaped runOneBrief and, since main() runs as void main(), surfaced as an unhandled rejection with no manifest written — losing the accumulated token signal. Wrap the stream+wait in try/catch and always write an error manifest carrying the usage gathered so far plus the error message in notes. Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com>

…harness

…#654) ## Linked issue Refs [TML-2736](https://linear.app/prisma-company/issue/TML-2736). Third slice of the **Drive — Judge + live-experiment harness** project; builds on the two-tier scorecard ([#640](#640)) and the golden-case harness ([#641](#641)). ## At a glance The judge grades one Drive run and emits the `intent` correctness signal the scorecard already reads. Two invariants do the load-bearing work. A malformed model response is never a silent pass: ```ts const validated = RubricResponse(parsed); if (validated instanceof type.errors) { return { intent: null, reasons: [`malformed model output: ${validated.summary}`] }; } ``` …and the emission preserves any gate-recorded `mechanical`/`qa` rather than clobbering it, because the scorecard is last-write-wins on the whole triple: ```ts export function mergedCorrectnessPayload(events, projectRunId, intent) { const prior = latestCorrectness(events, projectRunId); return { mechanical: prior?.mechanical ?? null, qa: prior?.qa ?? null, intent }; } ``` Before this slice, the scorecard's `intent` component was always `null` → every run was `not-computable`. This slice makes it producible. ## Summary This PR ships a **bespoke-minimal LLM judge** under `skills-contrib/drive-judge-harness/judge/`. It carries two substantive pieces: 1. **The judge itself** — grades a completed Drive run (the produced diff + the run's trace) against a golden case's `acceptance.md`, through a cross-family judge model, and emits the `intent` correctness component. Three prompt sets: a requirements+intent rubric, a failure-mode classifier, and an operator-turn classifier. 2. **The recorded decision to build it bespoke** — a time-boxed spike compared Inspect / Braintrust / promptfoo and confirmed bespoke-minimal. The rationale lands in the project `spec.md` and `design-notes.md`; promptfoo is the recorded escape hatch. The judge model is **injected** everywhere, so the whole subtree typechecks, tests, and lints with **no `CURSOR_API_KEY`** and **`@cursor/sdk` absent** — tests pass a mock. The live adapter is reached only behind the same `--live` + key gate as the harness. ## How it fits together 1. **The model boundary** (`judge/judge-model.ts`) — a one-method `JudgeModel` interface (`grade(prompt) => Promise<string>`). Everything downstream takes it as a dependency; tests inject a mock and never make a real call. 2. **The live adapter** (`judge/judge-model-sdk.ts`) — pins a cross-family judge id (default `gpt-5.5`) and **rejects a same-family judge id at construction** (a Claude judge grading a Claude orchestrator throws before any SDK code runs). The `@cursor/sdk` import is lazy, so module load stays green without the package. 3. **The three prompt sets** (`rubric-correctness.ts`, `classify-failure.ts`, `classify-operator.ts`) — each renders a prompt, calls the model, and parses an arktype-validated verdict. The operator-turn classifier uses the measurement model's five canonical buckets (`docs/drive/measurement-model.md`): legitimate-design, legitimate-authorisation, illegitimate-asked, illegitimate-correction, illegitimate-rescue. 4. **The merge-preserving emission** (`judge/emit-correctness.ts`) — folds the rubric's `intent` into the run's latest recorded `{mechanical, qa}` and emits one `correctness-recorded` event through the deterministic emitter. The slice-1 scorecard composes it; no scorecard or schema edits. 5. **The calibration harness** (`judge/calibration.ts` + `judge/calibration/labels.md`) — a judge-vs-human agreement tally with a ≥0.80 gate. The machinery lands; the calibration *run* is parked (see Reviewer notes). ## Reviewer notes - **The calibration run is deliberately parked, not forgotten.** Calibration needs ~10–20 instrumented runs, and corpus generation is real-dollar spend the operator is holding. So this slice ships the gate machinery and an honest "uncalibrated" status; the project-DoD calibration item stays unchecked. `SKILL.md` and `calibration/labels.md` both record the deferral and the operator-spend gate. - **The merge rule is the subtle part.** `computeScorecard` is last-write-wins on the whole `{mechanical, qa, intent}` triple — it does not merge components. A naive judge emitting `{mechanical:null, qa:null, intent:pass}` would erase a gate's recorded pass. `emit-correctness.ts` reads-merges-emits so that can't happen; the end-to-end test asserts a prior `mechanical:pass` survives. - **One unplanned helper.** `judge/parse-json.ts` lifts a JSON object out of a model response (bare / fenced / embedded) — factored out so the malformed-→null path lives in one place rather than three copies. - **The planning commit rides along.** The first commit scaffolds the slice (spec, plan, trace) and records the spike; the second is the implementation. They're one reviewable unit. ## Testing performed - `node --test` over the six new judge suites — **43 cases, all green**, run with `CURSOR_API_KEY` unset. - `pnpm typecheck` — clean. - `pnpm lint:deps` — no dependency violations. - `pnpm lint:casts` — `delta=0` (no new bare casts). - `pnpm test:scripts` — 545 cases green (nothing else regressed). ## Skill update `skills-contrib/drive-judge-harness/SKILL.md` documents the judge, the cross-family requirement, the `correctness-recorded` merge rule, the fail-to-null invariant, and the parked calibration. ## Checklist - [x] DCO sign-off on every commit - [x] Tests written first and passing - [x] Title follows the `TML-NNNN:` convention - [x] No new bare casts (`lint:casts` delta 0) ## Alternatives considered - **Adopt an off-the-shelf eval framework (Inspect / Braintrust / promptfoo).** Confirmed-rejected by the spike. They grade `(input → model output)`; our unit is a whole Drive run scored from trace + diff + golden acceptance set. A framework can host the tiny grading call but not the integration with our trace/scorecard/golden assets — that glue is bespoke either way. promptfoo (TS, MIT, local) is recorded as the escape hatch if the bespoke scorer grows hairy. - **Emit the `intent` component on its own.** Rejected — it would clobber the gate-recorded `mechanical`/`qa` under last-write-wins. Hence the merge-preserving helper. - **An `other` operator-turn bucket + non-null fallback.** Rejected — the measurement model defines exactly five buckets; a malformed response yields `bucket: null` (same fail-to-null discipline as the rubric) rather than an off-doc catch-all. - **Run the calibration now.** Rejected — corpus generation is held on cost. The judge ships uncalibrated-but-honest; the gate is computable the moment the corpus exists.  ## Summary by CodeRabbit # Release Notes * **New Features** * Introduced LLM-based judge system for Drive orchestrator evaluation with failure mode classification, operator turn assessment, and correctness rubric grading * Implemented cross-family model constraint enforcement between judge and orchestrator * Added calibration framework for judge accuracy validation with agreement-rate metrics * **Documentation** * Expanded judge harness documentation with detailed module descriptions and key invariants * Added calibration corpus specification and workflow guidance * **Tests** * Added comprehensive test coverage for judge components, classifiers, and calibration logic  --------- Signed-off-by: Will Madden <madden@prisma.io> Co-authored-by: Will Madden <madden@prisma.io>

wmadden-electric added 5 commits May 30, 2026 16:14

wmadden requested a review from a team as a code owner May 30, 2026 16:22

coderabbitai Bot reviewed May 30, 2026

View reviewed changes

Comment thread skills-contrib/drive-judge-harness/run-one-brief.ts Outdated

Comment thread skills-contrib/drive-judge-harness/SKILL.md

Comment thread skills-contrib/drive-judge-harness/test/validate-parser.test.ts

Comment thread skills-contrib/drive-judge-harness/validate-parser.ts

wmadden-electric and others added 5 commits May 30, 2026 19:29

Merge remote-tracking branch 'origin/main' into tml-2735-golden-case-…

5cb59c2

…harness Signed-off-by: wmadden-electric <286902546+wmadden-electric@users.noreply.github.com> # Conflicts: # package.json

Merge remote-tracking branch 'origin/main' into tml-2735-golden-case-…

717d13e

…harness

Merge branch 'main' into tml-2735-golden-case-harness

1ba83a5

wmadden merged commit ce41c26 into main May 30, 2026
10 checks passed

wmadden deleted the tml-2735-golden-case-harness branch May 30, 2026 19:20

wmadden-electric mentioned this pull request May 31, 2026

TML-2736: bespoke LLM judge — intent correctness signal + classifiers #654

Merged

4 tasks

This was referenced May 31, 2026

TML-2755: Pin the skill bundle under test (run-setup: prepare / collect / run-arm) #656

Merged

TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording #657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tml-2735: golden-case library + run-one-brief harness#641

tml-2735: golden-case library + run-one-brief harness#641
wmadden merged 11 commits into
mainfrom
tml-2735-golden-case-harness

wmadden commented May 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 30, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 30, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented May 30, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wmadden commented May 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Decision

What's here

1. Golden-case library — projects/drive-judge-harness/assets/golden/

2. run-one-brief harness — skills-contrib/drive-judge-harness/

3. Post-hoc parser validation (clears TML-2728)

Dependency note (TML-2720)

Deferred / out of scope

Operator-gated boundary (blocker reported, not worked around)

Gates (green)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

size-limit report 📦

Uh oh!

pkg-pr-new Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wmadden commented May 30, 2026 •

edited by coderabbitai Bot

Loading

1. Golden-case library — `projects/drive-judge-harness/assets/golden/`

2. `run-one-brief` harness — `skills-contrib/drive-judge-harness/`

coderabbitai Bot commented May 30, 2026 •

edited

Loading

github-actions Bot commented May 30, 2026 •

edited

Loading

pkg-pr-new Bot commented May 30, 2026 •

edited

Loading