Skip to content

TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording#657

Open
wmadden-electric wants to merge 10 commits into
mainfrom
tml-2757-run-fidelity
Open

TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording#657
wmadden-electric wants to merge 10 commits into
mainfrom
tml-2757-run-fidelity

Conversation

@wmadden-electric
Copy link
Copy Markdown
Contributor

@wmadden-electric wmadden-electric commented May 31, 2026

Linked issue

Refs TML-2759 (decouple the runtime) and TML-2757 (faithful run recording). Surfaced by the first live run-arm (run-setup slice, TML-2755, #656); unblocks the experiment engine (TML-2737).

At a glance

The first live harness run proved the pipeline but recorded a blank, polluted run on @cursor/sdk (no tokens — its local runtime emits none — agent_id: null, and 5 stray base traces). This PR makes runs faithful and decoupled from Cursor: the harness now runs the Drive orchestrator on the Claude Agent SDK by default, which reports tokens, USD cost, wall-clock, and turns natively.

{ "runtime": "claude", "model": "claude-haiku-4-5", "status": "finished",
  "run_id": "<session-id>",
  "tokens": { "inputTokens": 33, "outputTokens": 904, "cacheReadTokens": 230827, "cacheWriteTokens": 53995, "totalTokens": 285759 },
  "cost_usd": 0.1839242, "num_turns": 9, "wall_clock_ms": 16025,
  "collected_trace_paths": [ /* only the trace this run emitted */ ] }

--runtime cursor still works for spot-checking the Cursor substrate; it records tokens: null (the documented local gap) and relies on wall_clock_ms.

Decision

Two entangled pieces, both about recording a run faithfully:

  1. Decouple the runtime from Cursor; default to the Claude Agent SDK (TML-2759). The Cursor coupling lived behind a seam (CreateAgent / OrchestratorRun / RunOutcome) with one adapter. This adds a second adapter over Anthropic's @anthropic-ai/claude-agent-sdk and makes it the default, because: it reports per-run usage + total_cost_usd + duration_ms + num_turns on its result message (the signal Cursor's local runtime never gave us); it's the native home of the SKILL.md + subagent conventions the drive-* skills use (our skill-injection already materializes .claude/skills/, which is exactly its discovery model); and maxBudgetUsd gives a hard per-run dollar cap. Cursor stays as a selectable secondary.
  2. Fix the three run-fidelity defects the first run exposed (TML-2757). Capture agent_id from the stream status message (not the cloud-shaped outcome); capture wall_clock_ms; scope collect-run to traces emitted during the run. Tokens stay null for the cursor runtime with a documented gap (spike 2026-05-31).

How it fits together

  1. Lift the pure mappers out of the SDK-importing modules. Each runtime gets a no-SDK mapper module — sdk-events.ts (Cursor) and claude-events.ts (Claude) — so extraction logic is unit-testable with neither SDK installed, and each *-adapter.ts stays the sole importer of its SDK behind a lazy import. The live gate (only --live + the runtime's key reaches an SDK) is preserved.
  2. Capture the run signal at the seam. RunOutcome carries tokens / costUsd / numTurns / durationMs / agentId; run-one-brief prefers outcome.tokens (claude) and falls back to per-turn accumulation (cursor). RunManifest gains runtime, cost_usd, num_turns, wall_clock_ms.
  3. Select the runtime. --runtime <claude|cursor> (default claude) picks the adapter via a lazy import; the live gate keys off the runtime's env var (ANTHROPIC_API_KEY / CURSOR_API_KEY); --max-budget-usd threads to the claude adapter.
  4. Scope trace collection deterministically. prepare-run snapshots the .jsonl present after its baseline commit; collect-run excludes that set — robust to the gitignored wip/drive-trace/ location the real trace landed in (a git-diff approach would miss it).

Reviewer notes

  • Not installing @anthropic-ai/claude-agent-sdk in this PR is deliberate. The default claude path only imports it when --live and ANTHROPIC_API_KEY are both set (otherwise it dry-runs), so the default is safe without it; there's no Anthropic key available to smoke-test, and a live run is real-dollar spend the operator has asked to gate. Install + first live claude run is a one-step operator-gated follow-up (TML-2759). The adapter's mapping is unit-proven against the documented SDK shapes meanwhile.
  • tokens: null is honest, not a regression. It's now scoped to the cursor runtime; the spike (projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md) traced the whole Cursor SDK surface and found no token signal for local runs.
  • skills-contrib/ isn't a turbo workspace package, so it's gated by test:scripts + biome (as prior harness slices were), not turbo run typecheck. A pre-existing @prisma-next/target-postgres typecheck failure on main is unrelated to this diff.
  • Largest pieces to spot-check: claude-adapter.ts (the query() generator → OrchestratorRun adaptation, buffering the result message for wait()) and the run-one-brief.ts seam (token-source preference + per-runtime gating).

Verification

  • pnpm test:scripts — 616/616 pass (adds sdk-events, claude-events, and extended collect-run / prepare-run / run-one-brief / run-arm suites).
  • biome check on every touched file — clean. Transient-ID scan on the code diff — empty.

Follow-ups

  • Install @anthropic-ai/claude-agent-sdk + run the first live claude smoke run (claude-haiku-4-5) to confirm real token/cost capture end-to-end — operator-gated on a key + spend (TML-2759).
  • A non-SDK token source for the cursor runtime (admin/usage API) — out of scope (spike).

Alternatives considered

  • Stay on @cursor/sdk and source tokens post-run (spike option (a): analytics / cloud getRun by run_id). Rejected: analytics is emit-only with no token props; V1Run / RunResultMetadata carry none. Nothing to query — which is what motivated decoupling.
  • A raw-model-API thin agent loop (perfect token accounting, zero vendor lock-in). Rejected: it rebuilds subagents/tool-loop/skill-loading from scratch and measures a substrate no real harness uses (poor ecological validity) at high build cost.
  • mtime-based trace scoping. Rejected for snapshot-exclusion: deterministic and correct for gitignored trace paths.
  • Removing the Cursor adapter. Kept as a secondary — the seam makes it ~free, and it lets us spot-check the Cursor substrate.

Checklist

  • All commits are signed off (git commit -s).
  • Change is scoped to one logical concern (faithful + decoupled harness runs).
  • Tests are updated.
  • PR title is in TML-NNNN: <sentence-case title> form.
  • Skill update: this PR is the skill update — skills-contrib/drive-judge-harness SKILL.md + KNOWN-ISSUES.md document both runtimes, the cursor token gap, and the wall-clock fallback.

wmadden added 4 commits May 31, 2026 17:51
… (TML-2757)

Move isRecord, asString, extractUsage, extractText, streamEventFromMessage
(renamed from toStreamEvent), agentIdFromMessage, and outcomeFromResult into
a new sdk-events.ts that imports nothing from @cursor/sdk. sdk-adapter.ts
imports the mappers from sdk-events.ts and remains the sole SDK importer.

agentIdFromMessage reads the snake_case agent_id present on stream status
and assistant messages (real @cursor/sdk@1.0.15 local-runtime shape).

outcomeFromResult reads id→runId, status, and durationMs from the wait()
outcome shape; degrades gracefully for non-records.

test/sdk-events.test.ts feeds the real captured shapes from the spike and
asserts all extraction paths — runs with @cursor/sdk not installed.

Also commits the spike artifact and the run-fidelity slice spec/plan.

Signed-off-by: Will Madden <madden@prisma.io>
…d (TML-2757)

RunOutcome gains durationMs: number | null.

sdk-adapter.ts captures agent_id from the first stream message that carries
a non-null agentIdFromMessage result (the stream is fully drained before
wait() is called, so capturedAgentId is available). outcomeFromResult threads
durationMs from the wait() outcome; both flow into the returned RunOutcome.

manifest.ts gains wall_clock_ms: number | null (populated from outcome.durationMs
on live runs, null on dry-run / startup-failed / error paths).

run-one-brief.ts: tokens is now null when no turn-ended events are observed
(local runtime emits none); a finished live run with null tokens appends the
exact note "tokens unavailable: @cursor/sdk local runtime emits no usage
events (see spike 2026-05-31)" so the corpus records the honest signal.

Tests updated: durationMs: null added to all existing RunOutcome mocks;
new test "captures agent_id and wall_clock_ms from the outcome, and notes
null tokens" verifies the happy path end-to-end.

Signed-off-by: Will Madden <madden@prisma.io>
…TracePaths (TML-2757)

- Extract findJsonlFiles into trace-files.ts (new module, no circular deps)
- PreparedRun gains preexistingTracePaths: string[] — snapshot of .jsonl files
  present under runDir immediately after the baseline commit
- collectRun filters the candidate set to exclude preexistingTracePaths so only
  traces emitted during the agent run are considered
- Tests: prepare-run: two new cases (empty baseline, baseline with a .jsonl)
- Tests: collect-run: new describe block "preexistingTracePaths exclusion" with
  three cases (run-emitted only, all excluded, agent_id match over run-emitted set)

Signed-off-by: Will Madden <madden@prisma.io>
…e sdk-events test (TML-2757)

Wall-clock (durationMs) is the primary efficiency metric for local runs; tokens are null because the local @cursor/sdk runtime emits no usage events (spike-confirmed). Wire the new sdk-events test into test:scripts and drop a dead import.

Signed-off-by: Will Madden <madden@prisma.io>
@wmadden-electric wmadden-electric requested a review from a team as a code owner May 31, 2026 16:02
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

Review Change Stack

Warning

Review limit reached

@wmadden-electric, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 19 minutes and 14 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 13316a41-4970-455c-9e62-a4aba43cd9c0

📥 Commits

Reviewing files that changed from the base of the PR and between a6c7523 and 574146e.

⛔ Files ignored due to path filters (3)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
  • projects/drive-judge-harness/slices/claude-runtime/plan.md is excluded by !projects/**
  • projects/drive-judge-harness/slices/claude-runtime/spec.md is excluded by !projects/**
📒 Files selected for processing (15)
  • package.json
  • skills-contrib/drive-judge-harness/KNOWN-ISSUES.md
  • skills-contrib/drive-judge-harness/SKILL.md
  • skills-contrib/drive-judge-harness/claude-adapter.ts
  • skills-contrib/drive-judge-harness/claude-events.ts
  • skills-contrib/drive-judge-harness/manifest.ts
  • skills-contrib/drive-judge-harness/run-arm.ts
  • skills-contrib/drive-judge-harness/run-one-brief.ts
  • skills-contrib/drive-judge-harness/sdk-adapter.ts
  • skills-contrib/drive-judge-harness/sdk-events.ts
  • skills-contrib/drive-judge-harness/test/claude-events.test.ts
  • skills-contrib/drive-judge-harness/test/manifest.test.ts
  • skills-contrib/drive-judge-harness/test/run-arm.test.ts
  • skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts
  • skills-contrib/drive-judge-harness/test/run-one-brief.test.ts
📝 Walkthrough

Walkthrough

This PR refactors the drive-judge harness to handle local Cursor SDK runs that emit no token-usage signals. It extracts message mapping into a testable utility, filters preexisting trace files from collection, captures run duration in manifests as wall_clock_ms, and updates documentation to reflect that tokens is null for local runs.

Changes

Local runtime integration and trace management

Layer / File(s) Summary
SDK event mapping utilities and tests
skills-contrib/drive-judge-harness/sdk-events.ts, skills-contrib/drive-judge-harness/test/sdk-events.test.ts, package.json
New sdk-events.ts module exports pure type guards (isRecord, asString) and message/outcome mappers (streamEventFromMessage, agentIdFromMessage, outcomeFromResult, extractUsage, extractText) with comprehensive test coverage for valid shapes and degradation paths. Test suite added to test:scripts.
Trace file discovery utility
skills-contrib/drive-judge-harness/trace-files.ts
Exports findJsonlFiles(dir) helper for recursive .jsonl directory traversal with error handling, extracted for reuse across trace collection and preexisting-trace detection.
RunOutcome, PreparedRun, and manifest type updates
skills-contrib/drive-judge-harness/run-one-brief.ts, skills-contrib/drive-judge-harness/prepare-run.ts, skills-contrib/drive-judge-harness/manifest.ts
RunOutcome extends with durationMs: number | null, PreparedRun adds preexistingTracePaths: string[], and RunManifest.tokens docs clarify null cases for dry-run/startup-failure/no-usage scenarios.
Baseline trace snapshot in prepareRun
skills-contrib/drive-judge-harness/prepare-run.ts, skills-contrib/drive-judge-harness/test/prepare-run.test.ts
prepareRun captures all preexisting .jsonl files under runDir immediately after baseline commit via findJsonlFiles(config.runDir), exposed as preexistingTracePaths. Tests validate snapshot on empty and committed trace directories.
Trace collection filtering
skills-contrib/drive-judge-harness/collect-run.ts, skills-contrib/drive-judge-harness/test/collect-run.test.ts
collectRun filters discovered traces to exclude preexistingTracePaths, ensuring only run-emitted traces are returned and matched by agent ID. Tests verify exclusion and interaction with matching logic.
SDK adapter refactoring
skills-contrib/drive-judge-harness/sdk-adapter.ts
Delegates message/outcome mapping to sdk-events.ts utilities; removes inline toStreamEvent and adaptOutcome helpers. Captures agentId from first streamed message; wait() returns RunOutcome with both agentId and durationMs.
Run duration and null-token handling
skills-contrib/drive-judge-harness/run-one-brief.ts, skills-contrib/drive-judge-harness/test/run-one-brief.test.ts, skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts, skills-contrib/drive-judge-harness/test/run-arm.test.ts
run-one-brief.ts captures orchestrator outcome duration, writes to manifests as wall_clock_ms (including null for dry-run/error), treats tokens as null when no usage events captured, and adds "tokens unavailable" note. Tests and mocks updated across all run paths to include durationMs and verify null-token behavior.
Documentation and fixture updates
skills-contrib/drive-judge-harness/KNOWN-ISSUES.md, skills-contrib/drive-judge-harness/SKILL.md, skills-contrib/drive-judge-harness/test/manifest.test.ts
SKILL.md and KNOWN-ISSUES.md document that local SDK emits no token events, tokens: null is expected, and wall-clock duration is primary efficiency metric. Test fixtures updated to include wall_clock_ms field in manifests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • prisma/prisma-next#641: Modifies package.json test:scripts to include additional skills-contrib/drive-judge-harness test files in the node --test suite.
  • prisma/prisma-next#656: Extends prepare-run.ts and collect-run.ts pipeline with preexisting trace tracking and filtering, building directly on this PR's trace management foundation.

Suggested reviewers

  • wmadden
  • aqrln

🐰 A tale of tokens lost in the local winds,
Wall-clock measures now where usage spins,
Traces filtered clean, preexisting swept away,
Duration captured—the harness evolves today!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title mentions capturing agent_id and wall-clock, scoping trace collection, and documenting token gaps, which align with the documented objectives; however, it also includes a reference to 'Cursor decoupled' and mentions running the harness 'by default' which are not prominently reflected in the actual code changes shown in the diff. Clarify whether 'run the harness by default' and 'Cursor decoupled' are reflected in the changeset or if the title should focus on the primary changes: capturing agent_id, recording wall-clock duration, scoping trace collection, and documenting token limitations.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tml-2757-run-fidelity

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

size-limit report 📦

Path Size
postgres / no-emit 135.88 KB (0%)
postgres / emit 125.59 KB (0%)
mongo / no-emit 75.69 KB (0%)
mongo / emit 70.68 KB (0%)

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 31, 2026

Open in StackBlitz

@prisma-next/extension-author-tools

npm i https://pkg.pr.new/@prisma-next/extension-author-tools@657

@prisma-next/mongo-runtime

npm i https://pkg.pr.new/@prisma-next/mongo-runtime@657

@prisma-next/family-mongo

npm i https://pkg.pr.new/@prisma-next/family-mongo@657

@prisma-next/sql-runtime

npm i https://pkg.pr.new/@prisma-next/sql-runtime@657

@prisma-next/family-sql

npm i https://pkg.pr.new/@prisma-next/family-sql@657

@prisma-next/extension-arktype-json

npm i https://pkg.pr.new/@prisma-next/extension-arktype-json@657

@prisma-next/middleware-cache

npm i https://pkg.pr.new/@prisma-next/middleware-cache@657

@prisma-next/mongo

npm i https://pkg.pr.new/@prisma-next/mongo@657

@prisma-next/extension-paradedb

npm i https://pkg.pr.new/@prisma-next/extension-paradedb@657

@prisma-next/extension-pgvector

npm i https://pkg.pr.new/@prisma-next/extension-pgvector@657

@prisma-next/extension-postgis

npm i https://pkg.pr.new/@prisma-next/extension-postgis@657

@prisma-next/postgres

npm i https://pkg.pr.new/@prisma-next/postgres@657

@prisma-next/sql-orm-client

npm i https://pkg.pr.new/@prisma-next/sql-orm-client@657

@prisma-next/sqlite

npm i https://pkg.pr.new/@prisma-next/sqlite@657

@prisma-next/target-mongo

npm i https://pkg.pr.new/@prisma-next/target-mongo@657

@prisma-next/adapter-mongo

npm i https://pkg.pr.new/@prisma-next/adapter-mongo@657

@prisma-next/driver-mongo

npm i https://pkg.pr.new/@prisma-next/driver-mongo@657

@prisma-next/contract

npm i https://pkg.pr.new/@prisma-next/contract@657

@prisma-next/utils

npm i https://pkg.pr.new/@prisma-next/utils@657

@prisma-next/config

npm i https://pkg.pr.new/@prisma-next/config@657

@prisma-next/errors

npm i https://pkg.pr.new/@prisma-next/errors@657

@prisma-next/framework-components

npm i https://pkg.pr.new/@prisma-next/framework-components@657

@prisma-next/operations

npm i https://pkg.pr.new/@prisma-next/operations@657

@prisma-next/ts-render

npm i https://pkg.pr.new/@prisma-next/ts-render@657

@prisma-next/contract-authoring

npm i https://pkg.pr.new/@prisma-next/contract-authoring@657

@prisma-next/ids

npm i https://pkg.pr.new/@prisma-next/ids@657

@prisma-next/psl-parser

npm i https://pkg.pr.new/@prisma-next/psl-parser@657

@prisma-next/psl-printer

npm i https://pkg.pr.new/@prisma-next/psl-printer@657

@prisma-next/cli

npm i https://pkg.pr.new/@prisma-next/cli@657

@prisma-next/cli-telemetry

npm i https://pkg.pr.new/@prisma-next/cli-telemetry@657

@prisma-next/emitter

npm i https://pkg.pr.new/@prisma-next/emitter@657

@prisma-next/migration-tools

npm i https://pkg.pr.new/@prisma-next/migration-tools@657

prisma-next

npm i https://pkg.pr.new/prisma-next@657

@prisma-next/vite-plugin-contract-emit

npm i https://pkg.pr.new/@prisma-next/vite-plugin-contract-emit@657

@prisma-next/mongo-codec

npm i https://pkg.pr.new/@prisma-next/mongo-codec@657

@prisma-next/mongo-contract

npm i https://pkg.pr.new/@prisma-next/mongo-contract@657

@prisma-next/mongo-value

npm i https://pkg.pr.new/@prisma-next/mongo-value@657

@prisma-next/mongo-contract-psl

npm i https://pkg.pr.new/@prisma-next/mongo-contract-psl@657

@prisma-next/mongo-contract-ts

npm i https://pkg.pr.new/@prisma-next/mongo-contract-ts@657

@prisma-next/mongo-emitter

npm i https://pkg.pr.new/@prisma-next/mongo-emitter@657

@prisma-next/mongo-schema-ir

npm i https://pkg.pr.new/@prisma-next/mongo-schema-ir@657

@prisma-next/mongo-query-ast

npm i https://pkg.pr.new/@prisma-next/mongo-query-ast@657

@prisma-next/mongo-orm

npm i https://pkg.pr.new/@prisma-next/mongo-orm@657

@prisma-next/mongo-query-builder

npm i https://pkg.pr.new/@prisma-next/mongo-query-builder@657

@prisma-next/mongo-lowering

npm i https://pkg.pr.new/@prisma-next/mongo-lowering@657

@prisma-next/mongo-wire

npm i https://pkg.pr.new/@prisma-next/mongo-wire@657

@prisma-next/sql-contract

npm i https://pkg.pr.new/@prisma-next/sql-contract@657

@prisma-next/sql-errors

npm i https://pkg.pr.new/@prisma-next/sql-errors@657

@prisma-next/sql-operations

npm i https://pkg.pr.new/@prisma-next/sql-operations@657

@prisma-next/sql-schema-ir

npm i https://pkg.pr.new/@prisma-next/sql-schema-ir@657

@prisma-next/sql-contract-psl

npm i https://pkg.pr.new/@prisma-next/sql-contract-psl@657

@prisma-next/sql-contract-ts

npm i https://pkg.pr.new/@prisma-next/sql-contract-ts@657

@prisma-next/sql-contract-emitter

npm i https://pkg.pr.new/@prisma-next/sql-contract-emitter@657

@prisma-next/sql-lane-query-builder

npm i https://pkg.pr.new/@prisma-next/sql-lane-query-builder@657

@prisma-next/sql-relational-core

npm i https://pkg.pr.new/@prisma-next/sql-relational-core@657

@prisma-next/sql-builder

npm i https://pkg.pr.new/@prisma-next/sql-builder@657

@prisma-next/target-postgres

npm i https://pkg.pr.new/@prisma-next/target-postgres@657

@prisma-next/target-sqlite

npm i https://pkg.pr.new/@prisma-next/target-sqlite@657

@prisma-next/adapter-postgres

npm i https://pkg.pr.new/@prisma-next/adapter-postgres@657

@prisma-next/adapter-sqlite

npm i https://pkg.pr.new/@prisma-next/adapter-sqlite@657

@prisma-next/driver-postgres

npm i https://pkg.pr.new/@prisma-next/driver-postgres@657

@prisma-next/driver-sqlite

npm i https://pkg.pr.new/@prisma-next/driver-sqlite@657

commit: 574146e

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills-contrib/drive-judge-harness/KNOWN-ISSUES.md`:
- Line 60: Replace the transient spike path reference in KNOWN-ISSUES.md (the
"projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md"
mention) with a stable link to a durable doc or a short inline summary of the
finding; ensure the note about the probe against `@cursor/sdk`@1.0.15 remains but
either link to a long-lived documentation page/section or paste a one- or
two-sentence summary of the spike result so the entry does not depend on a
projects/ spike artifact that may move.

In `@skills-contrib/drive-judge-harness/SKILL.md`:
- Around line 175-176: Replace the transient spike file reference
`projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md` in
the sentence that mentions `tokens` with a stable reference: either a
KNOWN-ISSUES anchor (e.g., `KNOWN-ISSUES.md` or `#sdk-token-usage`) or a
one-sentence embedded summary of the spike's relevant finding; update the prose
where `tokens` is described so it reads with the stable link/summary instead of
the `projects/...` path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cf4b5d80-69f4-4ae0-ac50-89746f307427

📥 Commits

Reviewing files that changed from the base of the PR and between f779815 and a6c7523.

⛔ Files ignored due to path filters (3)
  • projects/drive-judge-harness/slices/run-fidelity/plan.md is excluded by !projects/**
  • projects/drive-judge-harness/slices/run-fidelity/spec.md is excluded by !projects/**
  • projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md is excluded by !projects/**
📒 Files selected for processing (17)
  • package.json
  • skills-contrib/drive-judge-harness/KNOWN-ISSUES.md
  • skills-contrib/drive-judge-harness/SKILL.md
  • skills-contrib/drive-judge-harness/collect-run.ts
  • skills-contrib/drive-judge-harness/manifest.ts
  • skills-contrib/drive-judge-harness/prepare-run.ts
  • skills-contrib/drive-judge-harness/run-one-brief.ts
  • skills-contrib/drive-judge-harness/sdk-adapter.ts
  • skills-contrib/drive-judge-harness/sdk-events.ts
  • skills-contrib/drive-judge-harness/test/collect-run.test.ts
  • skills-contrib/drive-judge-harness/test/manifest.test.ts
  • skills-contrib/drive-judge-harness/test/prepare-run.test.ts
  • skills-contrib/drive-judge-harness/test/run-arm.test.ts
  • skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts
  • skills-contrib/drive-judge-harness/test/run-one-brief.test.ts
  • skills-contrib/drive-judge-harness/test/sdk-events.test.ts
  • skills-contrib/drive-judge-harness/trace-files.ts

Comment thread skills-contrib/drive-judge-harness/KNOWN-ISSUES.md Outdated
Comment thread skills-contrib/drive-judge-harness/SKILL.md Outdated
wmadden added 4 commits May 31, 2026 18:12
…ecoupling (TML-2759)

Signed-off-by: Will Madden <madden@prisma.io>
Adds claude-events.ts (zero SDK imports) with usageFromAssistant,
streamEventFromMessage, and outcomeFromResult over unknown inputs.
Maps Claude snake_case fields to harness camelCase:
cache_creation_input_tokens -> cacheWriteTokens,
cache_read_input_tokens -> cacheReadTokens, session_id -> runId.
outcomeFromResult builds TokenTotals via accumulateUsage and returns
status/runId/tokens/durationMs/costUsd/numTurns or null for non-result
messages.

Fully unit-tested in test/claude-events.test.ts with real success and
error_max_turns fixtures. All 17 tests pass with the SDK absent.

Signed-off-by: Will Madden <madden@prisma.io>
…me/RunManifest

- claude-adapter.ts: CreateAgent over query(), lazy-imports SDK, requires
  ANTHROPIC_API_KEY, captures terminal result message for wait()
- RunOutcome gains tokens/costUsd/numTurns; run-one-brief prefers
  outcome.tokens, falls back to per-turn accumulation (Cursor path)
- RunManifest gains runtime/cost_usd/num_turns; all manifest writes set them
- RunOneBriefConfig/RunArmConfig gain runtime (default "claude") and
  optional maxBudgetUsd; key gate uses ANTHROPIC_API_KEY or CURSOR_API_KEY
  based on runtime selection
- sdk-adapter wait() returns tokens:null/costUsd:null/numTurns:null
- CLIs gain --runtime <claude|cursor> and --max-budget-usd <n>
- Tests updated: all configs gain runtime field, outcomes gain new null
  fields, new tests assert Claude-shaped outcome populates manifest
  runtime/cost_usd/num_turns/wall_clock_ms correctly

Signed-off-by: Will Madden <madden@prisma.io>
…ude-events test (TML-2759)

Claude Agent SDK is the default runtime (native tokens/cost/wall-clock/turns); Cursor is selectable via --runtime cursor and keeps its documented local token gap. Install of @anthropic-ai/claude-agent-sdk + first live claude run is an operator-gated follow-up (needs ANTHROPIC_API_KEY + authorized spend).

Signed-off-by: Will Madden <madden@prisma.io>
@wmadden-electric wmadden-electric changed the title TML-2757: capture agent_id + wall-clock, scope trace collection, document token gap TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording May 31, 2026
wmadden added 2 commits June 1, 2026 11:12
…, drop transient spike path

Durable skill docs/code must not link to projects/ artifacts that get deleted at close-out. The spike finding already lives in KNOWN-ISSUES.md section 2; reference that instead.

Signed-off-by: Will Madden <madden@prisma.io>
… default-runtime dependency

The claude runtime is the harness default; declare its SDK so the lazy import resolves without an ad-hoc install.

Signed-off-by: Will Madden <madden@prisma.io>
Copy link
Copy Markdown
Contributor

@wmadden wmadden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocked until I test run this end to end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants