TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording by wmadden-electric · Pull Request #657 · prisma/prisma-next

wmadden-electric · 2026-05-31T16:02:36Z

Linked issue

Refs TML-2759 (decouple the runtime) and TML-2757 (faithful run recording). Surfaced by the first live run-arm (run-setup slice, TML-2755, #656); unblocks the experiment engine (TML-2737).

At a glance

The first live harness run proved the pipeline but recorded a blank, polluted run on @cursor/sdk (no tokens — its local runtime emits none — agent_id: null, and 5 stray base traces). This PR makes runs faithful and decoupled from Cursor: the harness now runs the Drive orchestrator on the Claude Agent SDK by default, which reports tokens, USD cost, wall-clock, and turns natively.

{ "runtime": "claude", "model": "claude-haiku-4-5", "status": "finished",
  "run_id": "<session-id>",
  "tokens": { "inputTokens": 33, "outputTokens": 904, "cacheReadTokens": 230827, "cacheWriteTokens": 53995, "totalTokens": 285759 },
  "cost_usd": 0.1839242, "num_turns": 9, "wall_clock_ms": 16025,
  "collected_trace_paths": [ /* only the trace this run emitted */ ] }

--runtime cursor still works for spot-checking the Cursor substrate; it records tokens: null (the documented local gap) and relies on wall_clock_ms.

Decision

Two entangled pieces, both about recording a run faithfully:

Decouple the runtime from Cursor; default to the Claude Agent SDK (TML-2759). The Cursor coupling lived behind a seam (CreateAgent / OrchestratorRun / RunOutcome) with one adapter. This adds a second adapter over Anthropic's @anthropic-ai/claude-agent-sdk and makes it the default, because: it reports per-run usage + total_cost_usd + duration_ms + num_turns on its result message (the signal Cursor's local runtime never gave us); it's the native home of the SKILL.md + subagent conventions the drive-* skills use (our skill-injection already materializes .claude/skills/, which is exactly its discovery model); and maxBudgetUsd gives a hard per-run dollar cap. Cursor stays as a selectable secondary.
Fix the three run-fidelity defects the first run exposed (TML-2757). Capture agent_id from the stream status message (not the cloud-shaped outcome); capture wall_clock_ms; scope collect-run to traces emitted during the run. Tokens stay null for the cursor runtime with a documented gap (spike 2026-05-31).

How it fits together

Lift the pure mappers out of the SDK-importing modules. Each runtime gets a no-SDK mapper module — sdk-events.ts (Cursor) and claude-events.ts (Claude) — so extraction logic is unit-testable with neither SDK installed, and each *-adapter.ts stays the sole importer of its SDK behind a lazy import. The live gate (only --live + the runtime's key reaches an SDK) is preserved.
Capture the run signal at the seam. RunOutcome carries tokens / costUsd / numTurns / durationMs / agentId; run-one-brief prefers outcome.tokens (claude) and falls back to per-turn accumulation (cursor). RunManifest gains runtime, cost_usd, num_turns, wall_clock_ms.
Select the runtime. --runtime <claude|cursor> (default claude) picks the adapter via a lazy import; the live gate keys off the runtime's env var (ANTHROPIC_API_KEY / CURSOR_API_KEY); --max-budget-usd threads to the claude adapter.
Scope trace collection deterministically. prepare-run snapshots the .jsonl present after its baseline commit; collect-run excludes that set — robust to the gitignored wip/drive-trace/ location the real trace landed in (a git-diff approach would miss it).

Reviewer notes

Not installing @anthropic-ai/claude-agent-sdk in this PR is deliberate. The default claude path only imports it when --live and ANTHROPIC_API_KEY are both set (otherwise it dry-runs), so the default is safe without it; there's no Anthropic key available to smoke-test, and a live run is real-dollar spend the operator has asked to gate. Install + first live claude run is a one-step operator-gated follow-up (TML-2759). The adapter's mapping is unit-proven against the documented SDK shapes meanwhile.
tokens: null is honest, not a regression. It's now scoped to the cursor runtime; the spike (projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md) traced the whole Cursor SDK surface and found no token signal for local runs.
skills-contrib/ isn't a turbo workspace package, so it's gated by test:scripts + biome (as prior harness slices were), not turbo run typecheck. A pre-existing @prisma-next/target-postgres typecheck failure on main is unrelated to this diff.
Largest pieces to spot-check: claude-adapter.ts (the query() generator → OrchestratorRun adaptation, buffering the result message for wait()) and the run-one-brief.ts seam (token-source preference + per-runtime gating).

Verification

pnpm test:scripts — 616/616 pass (adds sdk-events, claude-events, and extended collect-run / prepare-run / run-one-brief / run-arm suites).
biome check on every touched file — clean. Transient-ID scan on the code diff — empty.

Follow-ups

Install @anthropic-ai/claude-agent-sdk + run the first live claude smoke run (claude-haiku-4-5) to confirm real token/cost capture end-to-end — operator-gated on a key + spend (TML-2759).
A non-SDK token source for the cursor runtime (admin/usage API) — out of scope (spike).

Alternatives considered

Stay on @cursor/sdk and source tokens post-run (spike option (a): analytics / cloud getRun by run_id). Rejected: analytics is emit-only with no token props; V1Run / RunResultMetadata carry none. Nothing to query — which is what motivated decoupling.
A raw-model-API thin agent loop (perfect token accounting, zero vendor lock-in). Rejected: it rebuilds subagents/tool-loop/skill-loading from scratch and measures a substrate no real harness uses (poor ecological validity) at high build cost.
mtime-based trace scoping. Rejected for snapshot-exclusion: deterministic and correct for gitignored trace paths.
Removing the Cursor adapter. Kept as a secondary — the seam makes it ~free, and it lets us spot-check the Cursor substrate.

Checklist

All commits are signed off (git commit -s).
Change is scoped to one logical concern (faithful + decoupled harness runs).
Tests are updated.
PR title is in TML-NNNN: <sentence-case title> form.
Skill update: this PR is the skill update — skills-contrib/drive-judge-harness SKILL.md + KNOWN-ISSUES.md document both runtimes, the cursor token gap, and the wall-clock fallback.

… (TML-2757) Move isRecord, asString, extractUsage, extractText, streamEventFromMessage (renamed from toStreamEvent), agentIdFromMessage, and outcomeFromResult into a new sdk-events.ts that imports nothing from @cursor/sdk. sdk-adapter.ts imports the mappers from sdk-events.ts and remains the sole SDK importer. agentIdFromMessage reads the snake_case agent_id present on stream status and assistant messages (real @cursor/sdk@1.0.15 local-runtime shape). outcomeFromResult reads id→runId, status, and durationMs from the wait() outcome shape; degrades gracefully for non-records. test/sdk-events.test.ts feeds the real captured shapes from the spike and asserts all extraction paths — runs with @cursor/sdk not installed. Also commits the spike artifact and the run-fidelity slice spec/plan. Signed-off-by: Will Madden <madden@prisma.io>

…d (TML-2757) RunOutcome gains durationMs: number | null. sdk-adapter.ts captures agent_id from the first stream message that carries a non-null agentIdFromMessage result (the stream is fully drained before wait() is called, so capturedAgentId is available). outcomeFromResult threads durationMs from the wait() outcome; both flow into the returned RunOutcome. manifest.ts gains wall_clock_ms: number | null (populated from outcome.durationMs on live runs, null on dry-run / startup-failed / error paths). run-one-brief.ts: tokens is now null when no turn-ended events are observed (local runtime emits none); a finished live run with null tokens appends the exact note "tokens unavailable: @cursor/sdk local runtime emits no usage events (see spike 2026-05-31)" so the corpus records the honest signal. Tests updated: durationMs: null added to all existing RunOutcome mocks; new test "captures agent_id and wall_clock_ms from the outcome, and notes null tokens" verifies the happy path end-to-end. Signed-off-by: Will Madden <madden@prisma.io>

…TracePaths (TML-2757) - Extract findJsonlFiles into trace-files.ts (new module, no circular deps) - PreparedRun gains preexistingTracePaths: string[] — snapshot of .jsonl files present under runDir immediately after the baseline commit - collectRun filters the candidate set to exclude preexistingTracePaths so only traces emitted during the agent run are considered - Tests: prepare-run: two new cases (empty baseline, baseline with a .jsonl) - Tests: collect-run: new describe block "preexistingTracePaths exclusion" with three cases (run-emitted only, all excluded, agent_id match over run-emitted set) Signed-off-by: Will Madden <madden@prisma.io>

…e sdk-events test (TML-2757) Wall-clock (durationMs) is the primary efficiency metric for local runs; tokens are null because the local @cursor/sdk runtime emits no usage events (spike-confirmed). Wire the new sdk-events test into test:scripts and drop a dead import. Signed-off-by: Will Madden <madden@prisma.io>

coderabbitai · 2026-05-31T16:02:50Z

Warning

Review limit reached

@wmadden-electric, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 19 minutes and 14 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 13316a41-4970-455c-9e62-a4aba43cd9c0

📥 Commits

Reviewing files that changed from the base of the PR and between a6c7523 and 574146e.

⛔ Files ignored due to path filters (3)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
projects/drive-judge-harness/slices/claude-runtime/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/claude-runtime/spec.md is excluded by !projects/**

📒 Files selected for processing (15)

package.json
skills-contrib/drive-judge-harness/KNOWN-ISSUES.md
skills-contrib/drive-judge-harness/SKILL.md
skills-contrib/drive-judge-harness/claude-adapter.ts
skills-contrib/drive-judge-harness/claude-events.ts
skills-contrib/drive-judge-harness/manifest.ts
skills-contrib/drive-judge-harness/run-arm.ts
skills-contrib/drive-judge-harness/run-one-brief.ts
skills-contrib/drive-judge-harness/sdk-adapter.ts
skills-contrib/drive-judge-harness/sdk-events.ts
skills-contrib/drive-judge-harness/test/claude-events.test.ts
skills-contrib/drive-judge-harness/test/manifest.test.ts
skills-contrib/drive-judge-harness/test/run-arm.test.ts
skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts
skills-contrib/drive-judge-harness/test/run-one-brief.test.ts

📝 Walkthrough

Walkthrough

This PR refactors the drive-judge harness to handle local Cursor SDK runs that emit no token-usage signals. It extracts message mapping into a testable utility, filters preexisting trace files from collection, captures run duration in manifests as wall_clock_ms, and updates documentation to reflect that tokens is null for local runs.

Changes

Local runtime integration and trace management

Layer / File(s)	Summary
SDK event mapping utilities and tests `skills-contrib/drive-judge-harness/sdk-events.ts`, `skills-contrib/drive-judge-harness/test/sdk-events.test.ts`, `package.json`	New `sdk-events.ts` module exports pure type guards (`isRecord`, `asString`) and message/outcome mappers (`streamEventFromMessage`, `agentIdFromMessage`, `outcomeFromResult`, `extractUsage`, `extractText`) with comprehensive test coverage for valid shapes and degradation paths. Test suite added to `test:scripts`.
Trace file discovery utility `skills-contrib/drive-judge-harness/trace-files.ts`	Exports `findJsonlFiles(dir)` helper for recursive `.jsonl` directory traversal with error handling, extracted for reuse across trace collection and preexisting-trace detection.
RunOutcome, PreparedRun, and manifest type updates `skills-contrib/drive-judge-harness/run-one-brief.ts`, `skills-contrib/drive-judge-harness/prepare-run.ts`, `skills-contrib/drive-judge-harness/manifest.ts`	`RunOutcome` extends with `durationMs: number \| null`, `PreparedRun` adds `preexistingTracePaths: string[]`, and `RunManifest.tokens` docs clarify null cases for dry-run/startup-failure/no-usage scenarios.
Baseline trace snapshot in prepareRun `skills-contrib/drive-judge-harness/prepare-run.ts`, `skills-contrib/drive-judge-harness/test/prepare-run.test.ts`	`prepareRun` captures all preexisting `.jsonl` files under `runDir` immediately after baseline commit via `findJsonlFiles(config.runDir)`, exposed as `preexistingTracePaths`. Tests validate snapshot on empty and committed trace directories.
Trace collection filtering `skills-contrib/drive-judge-harness/collect-run.ts`, `skills-contrib/drive-judge-harness/test/collect-run.test.ts`	`collectRun` filters discovered traces to exclude `preexistingTracePaths`, ensuring only run-emitted traces are returned and matched by agent ID. Tests verify exclusion and interaction with matching logic.
SDK adapter refactoring `skills-contrib/drive-judge-harness/sdk-adapter.ts`	Delegates message/outcome mapping to `sdk-events.ts` utilities; removes inline `toStreamEvent` and `adaptOutcome` helpers. Captures `agentId` from first streamed message; `wait()` returns `RunOutcome` with both `agentId` and `durationMs`.
Run duration and null-token handling `skills-contrib/drive-judge-harness/run-one-brief.ts`, `skills-contrib/drive-judge-harness/test/run-one-brief.test.ts`, `skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts`, `skills-contrib/drive-judge-harness/test/run-arm.test.ts`	`run-one-brief.ts` captures orchestrator outcome duration, writes to manifests as `wall_clock_ms` (including `null` for dry-run/error), treats `tokens` as `null` when no usage events captured, and adds "tokens unavailable" note. Tests and mocks updated across all run paths to include `durationMs` and verify null-token behavior.
Documentation and fixture updates `skills-contrib/drive-judge-harness/KNOWN-ISSUES.md`, `skills-contrib/drive-judge-harness/SKILL.md`, `skills-contrib/drive-judge-harness/test/manifest.test.ts`	SKILL.md and KNOWN-ISSUES.md document that local SDK emits no token events, `tokens: null` is expected, and wall-clock duration is primary efficiency metric. Test fixtures updated to include `wall_clock_ms` field in manifests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

prisma/prisma-next#641: Modifies package.json test:scripts to include additional skills-contrib/drive-judge-harness test files in the node --test suite.
prisma/prisma-next#656: Extends prepare-run.ts and collect-run.ts pipeline with preexisting trace tracking and filtering, building directly on this PR's trace management foundation.

Suggested reviewers

wmadden
aqrln

🐰 A tale of tokens lost in the local winds,
Wall-clock measures now where usage spins,
Traces filtered clean, preexisting swept away,
Duration captured—the harness evolves today! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title mentions capturing agent_id and wall-clock, scoping trace collection, and documenting token gaps, which align with the documented objectives; however, it also includes a reference to 'Cursor decoupled' and mentions running the harness 'by default' which are not prominently reflected in the actual code changes shown in the diff.	Clarify whether 'run the harness by default' and 'Cursor decoupled' are reflected in the changeset or if the title should focus on the primary changes: capturing agent_id, recording wall-clock duration, scoping trace collection, and documenting token limitations.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tml-2757-run-fidelity

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-31T16:04:58Z

size-limit report 📦

Path	Size
postgres / no-emit	135.88 KB (0%)
postgres / emit	125.59 KB (0%)
mongo / no-emit	75.69 KB (0%)
mongo / emit	70.68 KB (0%)

pkg-pr-new · 2026-05-31T16:05:23Z

Open in StackBlitz

@prisma-next/extension-author-tools

npm i https://pkg.pr.new/@prisma-next/extension-author-tools@657

@prisma-next/mongo-runtime

npm i https://pkg.pr.new/@prisma-next/mongo-runtime@657

@prisma-next/family-mongo

npm i https://pkg.pr.new/@prisma-next/family-mongo@657

@prisma-next/sql-runtime

npm i https://pkg.pr.new/@prisma-next/sql-runtime@657

@prisma-next/family-sql

npm i https://pkg.pr.new/@prisma-next/family-sql@657

@prisma-next/extension-arktype-json

npm i https://pkg.pr.new/@prisma-next/extension-arktype-json@657

@prisma-next/middleware-cache

npm i https://pkg.pr.new/@prisma-next/middleware-cache@657

@prisma-next/mongo

npm i https://pkg.pr.new/@prisma-next/mongo@657

@prisma-next/extension-paradedb

npm i https://pkg.pr.new/@prisma-next/extension-paradedb@657

@prisma-next/extension-pgvector

npm i https://pkg.pr.new/@prisma-next/extension-pgvector@657

@prisma-next/extension-postgis

npm i https://pkg.pr.new/@prisma-next/extension-postgis@657

@prisma-next/postgres

npm i https://pkg.pr.new/@prisma-next/postgres@657

@prisma-next/sql-orm-client

npm i https://pkg.pr.new/@prisma-next/sql-orm-client@657

@prisma-next/sqlite

npm i https://pkg.pr.new/@prisma-next/sqlite@657

@prisma-next/target-mongo

npm i https://pkg.pr.new/@prisma-next/target-mongo@657

@prisma-next/adapter-mongo

npm i https://pkg.pr.new/@prisma-next/adapter-mongo@657

@prisma-next/driver-mongo

npm i https://pkg.pr.new/@prisma-next/driver-mongo@657

@prisma-next/contract

npm i https://pkg.pr.new/@prisma-next/contract@657

@prisma-next/utils

npm i https://pkg.pr.new/@prisma-next/utils@657

@prisma-next/config

npm i https://pkg.pr.new/@prisma-next/config@657

@prisma-next/errors

npm i https://pkg.pr.new/@prisma-next/errors@657

@prisma-next/framework-components

npm i https://pkg.pr.new/@prisma-next/framework-components@657

@prisma-next/operations

npm i https://pkg.pr.new/@prisma-next/operations@657

@prisma-next/ts-render

npm i https://pkg.pr.new/@prisma-next/ts-render@657

@prisma-next/contract-authoring

npm i https://pkg.pr.new/@prisma-next/contract-authoring@657

@prisma-next/ids

npm i https://pkg.pr.new/@prisma-next/ids@657

@prisma-next/psl-parser

npm i https://pkg.pr.new/@prisma-next/psl-parser@657

@prisma-next/psl-printer

npm i https://pkg.pr.new/@prisma-next/psl-printer@657

@prisma-next/cli

npm i https://pkg.pr.new/@prisma-next/cli@657

@prisma-next/cli-telemetry

npm i https://pkg.pr.new/@prisma-next/cli-telemetry@657

@prisma-next/emitter

npm i https://pkg.pr.new/@prisma-next/emitter@657

@prisma-next/migration-tools

npm i https://pkg.pr.new/@prisma-next/migration-tools@657

prisma-next

npm i https://pkg.pr.new/prisma-next@657

@prisma-next/vite-plugin-contract-emit

npm i https://pkg.pr.new/@prisma-next/vite-plugin-contract-emit@657

@prisma-next/mongo-codec

npm i https://pkg.pr.new/@prisma-next/mongo-codec@657

@prisma-next/mongo-contract

npm i https://pkg.pr.new/@prisma-next/mongo-contract@657

@prisma-next/mongo-value

npm i https://pkg.pr.new/@prisma-next/mongo-value@657

@prisma-next/mongo-contract-psl

npm i https://pkg.pr.new/@prisma-next/mongo-contract-psl@657

@prisma-next/mongo-contract-ts

npm i https://pkg.pr.new/@prisma-next/mongo-contract-ts@657

@prisma-next/mongo-emitter

npm i https://pkg.pr.new/@prisma-next/mongo-emitter@657

@prisma-next/mongo-schema-ir

npm i https://pkg.pr.new/@prisma-next/mongo-schema-ir@657

@prisma-next/mongo-query-ast

npm i https://pkg.pr.new/@prisma-next/mongo-query-ast@657

@prisma-next/mongo-orm

npm i https://pkg.pr.new/@prisma-next/mongo-orm@657

@prisma-next/mongo-query-builder

npm i https://pkg.pr.new/@prisma-next/mongo-query-builder@657

@prisma-next/mongo-lowering

npm i https://pkg.pr.new/@prisma-next/mongo-lowering@657

@prisma-next/mongo-wire

npm i https://pkg.pr.new/@prisma-next/mongo-wire@657

@prisma-next/sql-contract

npm i https://pkg.pr.new/@prisma-next/sql-contract@657

@prisma-next/sql-errors

npm i https://pkg.pr.new/@prisma-next/sql-errors@657

@prisma-next/sql-operations

npm i https://pkg.pr.new/@prisma-next/sql-operations@657

@prisma-next/sql-schema-ir

npm i https://pkg.pr.new/@prisma-next/sql-schema-ir@657

@prisma-next/sql-contract-psl

npm i https://pkg.pr.new/@prisma-next/sql-contract-psl@657

@prisma-next/sql-contract-ts

npm i https://pkg.pr.new/@prisma-next/sql-contract-ts@657

@prisma-next/sql-contract-emitter

npm i https://pkg.pr.new/@prisma-next/sql-contract-emitter@657

@prisma-next/sql-lane-query-builder

npm i https://pkg.pr.new/@prisma-next/sql-lane-query-builder@657

@prisma-next/sql-relational-core

npm i https://pkg.pr.new/@prisma-next/sql-relational-core@657

@prisma-next/sql-builder

npm i https://pkg.pr.new/@prisma-next/sql-builder@657

@prisma-next/target-postgres

npm i https://pkg.pr.new/@prisma-next/target-postgres@657

@prisma-next/target-sqlite

npm i https://pkg.pr.new/@prisma-next/target-sqlite@657

@prisma-next/adapter-postgres

npm i https://pkg.pr.new/@prisma-next/adapter-postgres@657

@prisma-next/adapter-sqlite

npm i https://pkg.pr.new/@prisma-next/adapter-sqlite@657

@prisma-next/driver-postgres

npm i https://pkg.pr.new/@prisma-next/driver-postgres@657

@prisma-next/driver-sqlite

npm i https://pkg.pr.new/@prisma-next/driver-sqlite@657

commit: 574146e

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills-contrib/drive-judge-harness/KNOWN-ISSUES.md`:
- Line 60: Replace the transient spike path reference in KNOWN-ISSUES.md (the
"projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md"
mention) with a stable link to a durable doc or a short inline summary of the
finding; ensure the note about the probe against `@cursor/sdk`@1.0.15 remains but
either link to a long-lived documentation page/section or paste a one- or
two-sentence summary of the spike result so the entry does not depend on a
projects/ spike artifact that may move.

In `@skills-contrib/drive-judge-harness/SKILL.md`:
- Around line 175-176: Replace the transient spike file reference
`projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md` in
the sentence that mentions `tokens` with a stable reference: either a
KNOWN-ISSUES anchor (e.g., `KNOWN-ISSUES.md` or `#sdk-token-usage`) or a
one-sentence embedded summary of the spike's relevant finding; update the prose
where `tokens` is described so it reads with the stable link/summary instead of
the `projects/...` path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cf4b5d80-69f4-4ae0-ac50-89746f307427

📥 Commits

Reviewing files that changed from the base of the PR and between f779815 and a6c7523.

⛔ Files ignored due to path filters (3)

projects/drive-judge-harness/slices/run-fidelity/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/run-fidelity/spec.md is excluded by !projects/**
projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md is excluded by !projects/**

📒 Files selected for processing (17)

package.json
skills-contrib/drive-judge-harness/KNOWN-ISSUES.md
skills-contrib/drive-judge-harness/SKILL.md
skills-contrib/drive-judge-harness/collect-run.ts
skills-contrib/drive-judge-harness/manifest.ts
skills-contrib/drive-judge-harness/prepare-run.ts
skills-contrib/drive-judge-harness/run-one-brief.ts
skills-contrib/drive-judge-harness/sdk-adapter.ts
skills-contrib/drive-judge-harness/sdk-events.ts
skills-contrib/drive-judge-harness/test/collect-run.test.ts
skills-contrib/drive-judge-harness/test/manifest.test.ts
skills-contrib/drive-judge-harness/test/prepare-run.test.ts
skills-contrib/drive-judge-harness/test/run-arm.test.ts
skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts
skills-contrib/drive-judge-harness/test/run-one-brief.test.ts
skills-contrib/drive-judge-harness/test/sdk-events.test.ts
skills-contrib/drive-judge-harness/trace-files.ts

…ecoupling (TML-2759) Signed-off-by: Will Madden <madden@prisma.io>

Adds claude-events.ts (zero SDK imports) with usageFromAssistant, streamEventFromMessage, and outcomeFromResult over unknown inputs. Maps Claude snake_case fields to harness camelCase: cache_creation_input_tokens -> cacheWriteTokens, cache_read_input_tokens -> cacheReadTokens, session_id -> runId. outcomeFromResult builds TokenTotals via accumulateUsage and returns status/runId/tokens/durationMs/costUsd/numTurns or null for non-result messages. Fully unit-tested in test/claude-events.test.ts with real success and error_max_turns fixtures. All 17 tests pass with the SDK absent. Signed-off-by: Will Madden <madden@prisma.io>

…me/RunManifest - claude-adapter.ts: CreateAgent over query(), lazy-imports SDK, requires ANTHROPIC_API_KEY, captures terminal result message for wait() - RunOutcome gains tokens/costUsd/numTurns; run-one-brief prefers outcome.tokens, falls back to per-turn accumulation (Cursor path) - RunManifest gains runtime/cost_usd/num_turns; all manifest writes set them - RunOneBriefConfig/RunArmConfig gain runtime (default "claude") and optional maxBudgetUsd; key gate uses ANTHROPIC_API_KEY or CURSOR_API_KEY based on runtime selection - sdk-adapter wait() returns tokens:null/costUsd:null/numTurns:null - CLIs gain --runtime <claude|cursor> and --max-budget-usd <n> - Tests updated: all configs gain runtime field, outcomes gain new null fields, new tests assert Claude-shaped outcome populates manifest runtime/cost_usd/num_turns/wall_clock_ms correctly Signed-off-by: Will Madden <madden@prisma.io>

…ude-events test (TML-2759) Claude Agent SDK is the default runtime (native tokens/cost/wall-clock/turns); Cursor is selectable via --runtime cursor and keeps its documented local token gap. Install of @anthropic-ai/claude-agent-sdk + first live claude run is an operator-gated follow-up (needs ANTHROPIC_API_KEY + authorized spend). Signed-off-by: Will Madden <madden@prisma.io>

…, drop transient spike path Durable skill docs/code must not link to projects/ artifacts that get deleted at close-out. The spike finding already lives in KNOWN-ISSUES.md section 2; reference that instead. Signed-off-by: Will Madden <madden@prisma.io>

… default-runtime dependency The claude runtime is the harness default; declare its SDK so the lazy import resolves without an ad-hoc install. Signed-off-by: Will Madden <madden@prisma.io>

wmadden

Blocked until I test run this end to end

wmadden added 4 commits May 31, 2026 17:51

wmadden-electric requested a review from a team as a code owner May 31, 2026 16:02

coderabbitai Bot reviewed May 31, 2026

View reviewed changes

Comment thread skills-contrib/drive-judge-harness/KNOWN-ISSUES.md Outdated

Comment thread skills-contrib/drive-judge-harness/SKILL.md Outdated

wmadden added 4 commits May 31, 2026 18:12

docs(drive-judge-harness): spec + plan for Claude Agent SDK runtime d…

a00ae29

…ecoupling (TML-2759) Signed-off-by: Will Madden <madden@prisma.io>

wmadden-electric changed the title ~~TML-2757: capture agent_id + wall-clock, scope trace collection, document token gap~~ TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording May 31, 2026

wmadden approved these changes Jun 1, 2026

View reviewed changes

wmadden added 2 commits June 1, 2026 11:12

build(drive-judge-harness): add @anthropic-ai/claude-agent-sdk as the…

574146e

… default-runtime dependency The claude runtime is the harness default; declare its SDK so the lazy import resolves without an ad-hoc install. Signed-off-by: Will Madden <madden@prisma.io>

wmadden requested changes Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording#657

TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording#657
wmadden-electric wants to merge 10 commits into
mainfrom
tml-2757-run-fidelity

wmadden-electric commented May 31, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 31, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

github-actions Bot commented May 31, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented May 31, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

wmadden left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wmadden-electric commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Linked issue

At a glance

Decision

How it fits together

Reviewer notes

Verification

Follow-ups

Alternatives considered

Checklist

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

size-limit report 📦

Uh oh!

pkg-pr-new Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wmadden left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wmadden-electric commented May 31, 2026 •

edited

Loading

coderabbitai Bot commented May 31, 2026 •

edited

Loading

github-actions Bot commented May 31, 2026 •

edited

Loading

pkg-pr-new Bot commented May 31, 2026 •

edited

Loading