TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording#657
TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording#657wmadden-electric wants to merge 10 commits into
Conversation
… (TML-2757) Move isRecord, asString, extractUsage, extractText, streamEventFromMessage (renamed from toStreamEvent), agentIdFromMessage, and outcomeFromResult into a new sdk-events.ts that imports nothing from @cursor/sdk. sdk-adapter.ts imports the mappers from sdk-events.ts and remains the sole SDK importer. agentIdFromMessage reads the snake_case agent_id present on stream status and assistant messages (real @cursor/sdk@1.0.15 local-runtime shape). outcomeFromResult reads id→runId, status, and durationMs from the wait() outcome shape; degrades gracefully for non-records. test/sdk-events.test.ts feeds the real captured shapes from the spike and asserts all extraction paths — runs with @cursor/sdk not installed. Also commits the spike artifact and the run-fidelity slice spec/plan. Signed-off-by: Will Madden <madden@prisma.io>
…d (TML-2757) RunOutcome gains durationMs: number | null. sdk-adapter.ts captures agent_id from the first stream message that carries a non-null agentIdFromMessage result (the stream is fully drained before wait() is called, so capturedAgentId is available). outcomeFromResult threads durationMs from the wait() outcome; both flow into the returned RunOutcome. manifest.ts gains wall_clock_ms: number | null (populated from outcome.durationMs on live runs, null on dry-run / startup-failed / error paths). run-one-brief.ts: tokens is now null when no turn-ended events are observed (local runtime emits none); a finished live run with null tokens appends the exact note "tokens unavailable: @cursor/sdk local runtime emits no usage events (see spike 2026-05-31)" so the corpus records the honest signal. Tests updated: durationMs: null added to all existing RunOutcome mocks; new test "captures agent_id and wall_clock_ms from the outcome, and notes null tokens" verifies the happy path end-to-end. Signed-off-by: Will Madden <madden@prisma.io>
…TracePaths (TML-2757) - Extract findJsonlFiles into trace-files.ts (new module, no circular deps) - PreparedRun gains preexistingTracePaths: string[] — snapshot of .jsonl files present under runDir immediately after the baseline commit - collectRun filters the candidate set to exclude preexistingTracePaths so only traces emitted during the agent run are considered - Tests: prepare-run: two new cases (empty baseline, baseline with a .jsonl) - Tests: collect-run: new describe block "preexistingTracePaths exclusion" with three cases (run-emitted only, all excluded, agent_id match over run-emitted set) Signed-off-by: Will Madden <madden@prisma.io>
…e sdk-events test (TML-2757) Wall-clock (durationMs) is the primary efficiency metric for local runs; tokens are null because the local @cursor/sdk runtime emits no usage events (spike-confirmed). Wire the new sdk-events test into test:scripts and drop a dead import. Signed-off-by: Will Madden <madden@prisma.io>
|
Warning Review limit reached
More reviews will be available in 19 minutes and 14 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (3)
📒 Files selected for processing (15)
📝 WalkthroughWalkthroughThis PR refactors the drive-judge harness to handle local Cursor SDK runs that emit no token-usage signals. It extracts message mapping into a testable utility, filters preexisting trace files from collection, captures run duration in manifests as ChangesLocal runtime integration and trace management
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
size-limit report 📦
|
@prisma-next/extension-author-tools
@prisma-next/mongo-runtime
@prisma-next/family-mongo
@prisma-next/sql-runtime
@prisma-next/family-sql
@prisma-next/extension-arktype-json
@prisma-next/middleware-cache
@prisma-next/mongo
@prisma-next/extension-paradedb
@prisma-next/extension-pgvector
@prisma-next/extension-postgis
@prisma-next/postgres
@prisma-next/sql-orm-client
@prisma-next/sqlite
@prisma-next/target-mongo
@prisma-next/adapter-mongo
@prisma-next/driver-mongo
@prisma-next/contract
@prisma-next/utils
@prisma-next/config
@prisma-next/errors
@prisma-next/framework-components
@prisma-next/operations
@prisma-next/ts-render
@prisma-next/contract-authoring
@prisma-next/ids
@prisma-next/psl-parser
@prisma-next/psl-printer
@prisma-next/cli
@prisma-next/cli-telemetry
@prisma-next/emitter
@prisma-next/migration-tools
prisma-next
@prisma-next/vite-plugin-contract-emit
@prisma-next/mongo-codec
@prisma-next/mongo-contract
@prisma-next/mongo-value
@prisma-next/mongo-contract-psl
@prisma-next/mongo-contract-ts
@prisma-next/mongo-emitter
@prisma-next/mongo-schema-ir
@prisma-next/mongo-query-ast
@prisma-next/mongo-orm
@prisma-next/mongo-query-builder
@prisma-next/mongo-lowering
@prisma-next/mongo-wire
@prisma-next/sql-contract
@prisma-next/sql-errors
@prisma-next/sql-operations
@prisma-next/sql-schema-ir
@prisma-next/sql-contract-psl
@prisma-next/sql-contract-ts
@prisma-next/sql-contract-emitter
@prisma-next/sql-lane-query-builder
@prisma-next/sql-relational-core
@prisma-next/sql-builder
@prisma-next/target-postgres
@prisma-next/target-sqlite
@prisma-next/adapter-postgres
@prisma-next/adapter-sqlite
@prisma-next/driver-postgres
@prisma-next/driver-sqlite
commit: |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@skills-contrib/drive-judge-harness/KNOWN-ISSUES.md`:
- Line 60: Replace the transient spike path reference in KNOWN-ISSUES.md (the
"projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md"
mention) with a stable link to a durable doc or a short inline summary of the
finding; ensure the note about the probe against `@cursor/sdk`@1.0.15 remains but
either link to a long-lived documentation page/section or paste a one- or
two-sentence summary of the spike result so the entry does not depend on a
projects/ spike artifact that may move.
In `@skills-contrib/drive-judge-harness/SKILL.md`:
- Around line 175-176: Replace the transient spike file reference
`projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md` in
the sentence that mentions `tokens` with a stable reference: either a
KNOWN-ISSUES anchor (e.g., `KNOWN-ISSUES.md` or `#sdk-token-usage`) or a
one-sentence embedded summary of the spike's relevant finding; update the prose
where `tokens` is described so it reads with the stable link/summary instead of
the `projects/...` path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: cf4b5d80-69f4-4ae0-ac50-89746f307427
⛔ Files ignored due to path filters (3)
projects/drive-judge-harness/slices/run-fidelity/plan.mdis excluded by!projects/**projects/drive-judge-harness/slices/run-fidelity/spec.mdis excluded by!projects/**projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.mdis excluded by!projects/**
📒 Files selected for processing (17)
package.jsonskills-contrib/drive-judge-harness/KNOWN-ISSUES.mdskills-contrib/drive-judge-harness/SKILL.mdskills-contrib/drive-judge-harness/collect-run.tsskills-contrib/drive-judge-harness/manifest.tsskills-contrib/drive-judge-harness/prepare-run.tsskills-contrib/drive-judge-harness/run-one-brief.tsskills-contrib/drive-judge-harness/sdk-adapter.tsskills-contrib/drive-judge-harness/sdk-events.tsskills-contrib/drive-judge-harness/test/collect-run.test.tsskills-contrib/drive-judge-harness/test/manifest.test.tsskills-contrib/drive-judge-harness/test/prepare-run.test.tsskills-contrib/drive-judge-harness/test/run-arm.test.tsskills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.tsskills-contrib/drive-judge-harness/test/run-one-brief.test.tsskills-contrib/drive-judge-harness/test/sdk-events.test.tsskills-contrib/drive-judge-harness/trace-files.ts
…ecoupling (TML-2759) Signed-off-by: Will Madden <madden@prisma.io>
Adds claude-events.ts (zero SDK imports) with usageFromAssistant, streamEventFromMessage, and outcomeFromResult over unknown inputs. Maps Claude snake_case fields to harness camelCase: cache_creation_input_tokens -> cacheWriteTokens, cache_read_input_tokens -> cacheReadTokens, session_id -> runId. outcomeFromResult builds TokenTotals via accumulateUsage and returns status/runId/tokens/durationMs/costUsd/numTurns or null for non-result messages. Fully unit-tested in test/claude-events.test.ts with real success and error_max_turns fixtures. All 17 tests pass with the SDK absent. Signed-off-by: Will Madden <madden@prisma.io>
…me/RunManifest - claude-adapter.ts: CreateAgent over query(), lazy-imports SDK, requires ANTHROPIC_API_KEY, captures terminal result message for wait() - RunOutcome gains tokens/costUsd/numTurns; run-one-brief prefers outcome.tokens, falls back to per-turn accumulation (Cursor path) - RunManifest gains runtime/cost_usd/num_turns; all manifest writes set them - RunOneBriefConfig/RunArmConfig gain runtime (default "claude") and optional maxBudgetUsd; key gate uses ANTHROPIC_API_KEY or CURSOR_API_KEY based on runtime selection - sdk-adapter wait() returns tokens:null/costUsd:null/numTurns:null - CLIs gain --runtime <claude|cursor> and --max-budget-usd <n> - Tests updated: all configs gain runtime field, outcomes gain new null fields, new tests assert Claude-shaped outcome populates manifest runtime/cost_usd/num_turns/wall_clock_ms correctly Signed-off-by: Will Madden <madden@prisma.io>
…ude-events test (TML-2759) Claude Agent SDK is the default runtime (native tokens/cost/wall-clock/turns); Cursor is selectable via --runtime cursor and keeps its documented local token gap. Install of @anthropic-ai/claude-agent-sdk + first live claude run is an operator-gated follow-up (needs ANTHROPIC_API_KEY + authorized spend). Signed-off-by: Will Madden <madden@prisma.io>
…, drop transient spike path Durable skill docs/code must not link to projects/ artifacts that get deleted at close-out. The spike finding already lives in KNOWN-ISSUES.md section 2; reference that instead. Signed-off-by: Will Madden <madden@prisma.io>
… default-runtime dependency The claude runtime is the harness default; declare its SDK so the lazy import resolves without an ad-hoc install. Signed-off-by: Will Madden <madden@prisma.io>
wmadden
left a comment
There was a problem hiding this comment.
Blocked until I test run this end to end
Linked issue
Refs TML-2759 (decouple the runtime) and TML-2757 (faithful run recording). Surfaced by the first live
run-arm(run-setup slice, TML-2755, #656); unblocks the experiment engine (TML-2737).At a glance
The first live harness run proved the pipeline but recorded a blank, polluted run on
@cursor/sdk(no tokens — its local runtime emits none —agent_id: null, and 5 stray base traces). This PR makes runs faithful and decoupled from Cursor: the harness now runs the Drive orchestrator on the Claude Agent SDK by default, which reports tokens, USD cost, wall-clock, and turns natively.{ "runtime": "claude", "model": "claude-haiku-4-5", "status": "finished", "run_id": "<session-id>", "tokens": { "inputTokens": 33, "outputTokens": 904, "cacheReadTokens": 230827, "cacheWriteTokens": 53995, "totalTokens": 285759 }, "cost_usd": 0.1839242, "num_turns": 9, "wall_clock_ms": 16025, "collected_trace_paths": [ /* only the trace this run emitted */ ] }--runtime cursorstill works for spot-checking the Cursor substrate; it recordstokens: null(the documented local gap) and relies onwall_clock_ms.Decision
Two entangled pieces, both about recording a run faithfully:
CreateAgent/OrchestratorRun/RunOutcome) with one adapter. This adds a second adapter over Anthropic's@anthropic-ai/claude-agent-sdkand makes it the default, because: it reports per-runusage+total_cost_usd+duration_ms+num_turnson its result message (the signal Cursor's local runtime never gave us); it's the native home of the SKILL.md + subagent conventions the drive-* skills use (our skill-injection already materializes.claude/skills/, which is exactly its discovery model); andmaxBudgetUsdgives a hard per-run dollar cap. Cursor stays as a selectable secondary.agent_idfrom the streamstatusmessage (not the cloud-shaped outcome); capturewall_clock_ms; scopecollect-runto traces emitted during the run. Tokens staynullfor the cursor runtime with a documented gap (spike2026-05-31).How it fits together
sdk-events.ts(Cursor) andclaude-events.ts(Claude) — so extraction logic is unit-testable with neither SDK installed, and each*-adapter.tsstays the sole importer of its SDK behind a lazy import. The live gate (only--live+ the runtime's key reaches an SDK) is preserved.RunOutcomecarriestokens/costUsd/numTurns/durationMs/agentId;run-one-briefprefersoutcome.tokens(claude) and falls back to per-turn accumulation (cursor).RunManifestgainsruntime,cost_usd,num_turns,wall_clock_ms.--runtime <claude|cursor>(default claude) picks the adapter via a lazy import; the live gate keys off the runtime's env var (ANTHROPIC_API_KEY/CURSOR_API_KEY);--max-budget-usdthreads to the claude adapter.prepare-runsnapshots the.jsonlpresent after its baseline commit;collect-runexcludes that set — robust to the gitignoredwip/drive-trace/location the real trace landed in (a git-diff approach would miss it).Reviewer notes
@anthropic-ai/claude-agent-sdkin this PR is deliberate. The default claude path only imports it when--liveandANTHROPIC_API_KEYare both set (otherwise it dry-runs), so the default is safe without it; there's no Anthropic key available to smoke-test, and a live run is real-dollar spend the operator has asked to gate. Install + first live claude run is a one-step operator-gated follow-up (TML-2759). The adapter's mapping is unit-proven against the documented SDK shapes meanwhile.tokens: nullis honest, not a regression. It's now scoped to the cursor runtime; the spike (projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md) traced the whole Cursor SDK surface and found no token signal for local runs.skills-contrib/isn't a turbo workspace package, so it's gated bytest:scripts+ biome (as prior harness slices were), notturbo run typecheck. A pre-existing@prisma-next/target-postgrestypecheck failure onmainis unrelated to this diff.claude-adapter.ts(thequery()generator →OrchestratorRunadaptation, buffering the result message forwait()) and therun-one-brief.tsseam (token-source preference + per-runtime gating).Verification
pnpm test:scripts— 616/616 pass (addssdk-events,claude-events, and extendedcollect-run/prepare-run/run-one-brief/run-armsuites).biome checkon every touched file — clean. Transient-ID scan on the code diff — empty.Follow-ups
@anthropic-ai/claude-agent-sdk+ run the first live claude smoke run (claude-haiku-4-5) to confirm real token/cost capture end-to-end — operator-gated on a key + spend (TML-2759).Alternatives considered
@cursor/sdkand source tokens post-run (spike option (a):analytics/ cloudgetRunbyrun_id). Rejected:analyticsis emit-only with no token props;V1Run/RunResultMetadatacarry none. Nothing to query — which is what motivated decoupling.Checklist
git commit -s).TML-NNNN: <sentence-case title>form.skills-contrib/drive-judge-harnessSKILL.md + KNOWN-ISSUES.md document both runtimes, the cursor token gap, and the wall-clock fallback.