-
Notifications
You must be signed in to change notification settings - Fork 12
TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording #657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 4 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
ee05d2b
feat(drive-judge-harness): D1 extract pure mappers into sdk-events.ts…
wmadden 025209f
feat(drive-judge-harness): D2 capture agent_id + wall-clock end-to-en…
wmadden 7961a44
feat(drive-judge-harness): D3 collect-run run-scoping via preexisting…
wmadden a6c7523
docs(drive-judge-harness): document the local-runtime token gap + wir…
wmadden a00ae29
docs(drive-judge-harness): spec + plan for Claude Agent SDK runtime d…
wmadden 64dcb73
feat(harness): add claude-events.ts pure mappers for Claude SDK shapes
wmadden f6286c1
feat(harness): add Claude adapter as default runtime, extend RunOutco…
wmadden ce8a1c3
docs(drive-judge-harness): document claude/cursor runtimes + wire cla…
wmadden 0864b47
docs(drive-judge-harness): point token-gap references to KNOWN-ISSUES…
wmadden 574146e
build(drive-judge-harness): add @anthropic-ai/claude-agent-sdk as the…
wmadden File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| # Plan: run-fidelity (TML-2757) | ||
|
|
||
| Test-first throughout. The live SDK is reached only via `sdk-adapter.ts`'s dynamic import; all new logic lives in no-SDK-import modules so it's unit-testable with `@cursor/sdk` absent. Spike `2026-05-31-sdk-token-usage-retrieval.md` is committed in dispatch 1. | ||
|
|
||
| ## Dispatches | ||
|
|
||
| ### D1 — `sdk-events.ts`: pure mappers + real-shape extraction (test-first) | ||
| - **Outcome:** message/outcome mapping lives in a no-SDK module, with `agent_id` and `durationMs` extracted from the **real captured shapes**. | ||
| - Move `extractText` / `toStreamEvent` / `adaptOutcome` (and the now-dead `extractUsage`) out of `sdk-adapter.ts` into new `sdk-events.ts` (imports nothing from the SDK; operates over `unknown`). Add `agentIdFromMessage`, `outcomeFromResult` (→ `{status,runId,durationMs}`), `streamEventFromMessage`. | ||
| - Tests (`test/sdk-events.test.ts`): feed the real `status`/`assistant`/outcome fixtures from the spike; assert `agent_id`, `durationMs`, stream mapping. Runs with the SDK uninstalled. | ||
| - `sdk-adapter.ts` imports the mappers (no behaviour change). | ||
| - Commit the spike artifact here. | ||
| - **Builds on:** merged run-setup. **Hands to:** D2. | ||
|
|
||
| ### D2 — capture agent_id + wall-clock end-to-end (test-first) | ||
| - **Outcome:** a finished run records the real `agent_id` and `wall_clock_ms`. | ||
| - `run-one-brief.ts`: `RunOutcome` gains `durationMs: number | null`; adapter captures `agent_id` from the first stream message carrying one and returns it from `wait()`. | ||
| - `manifest.ts`: add `wall_clock_ms`; add the token-unavailable note when `tokens` is null on a finished live run. `run-arm.ts` threads `wall_clock_ms` into the enriched manifest. | ||
| - Tests: outcome→manifest mapping populates `agent_id` + `wall_clock_ms`; null-token note present. | ||
| - **Builds on:** D1. **Hands to:** D3. | ||
|
|
||
| ### D3 — `collect-run` run-scoping (test-first) | ||
| - **Outcome:** `collectRun` returns only traces emitted during the run. | ||
| - `prepare-run.ts`: snapshot `*.jsonl` under `runDir` after the baseline commit → `PreparedRun.preexistingTracePaths`. | ||
| - `collect-run.ts`: exclude `preexistingTracePaths`; `agent_id` match over the remainder. | ||
| - Tests: baseline-committed trace + run-emitted trace → only the latter returned (cover a gitignored-path trace). | ||
| - **Builds on:** D2. **Hands to:** D4. | ||
|
|
||
| ### D4 — docs + gates + PR | ||
| - **Outcome:** token gap documented; suite green; PR open. | ||
| - SKILL.md / KNOWN-ISSUES: token gap (link spike) + wall-clock-as-primary note. | ||
| - Wire new tests into `test:scripts`; run `pnpm -w typecheck`, `pnpm -w lint`, `pnpm -w test:scripts`; fix fallout. | ||
| - Stage explicitly, sign off, push to `tml-2757-run-fidelity`, open PR (create-pr skill). | ||
| - **Builds on:** D3. | ||
|
|
||
| ## Sequencing | ||
| Serial: D1 unlocks testability, D2 consumes the extractors, D3 is independent of D2 but shares the manifest touch (sequence after to avoid conflict), D4 closes. Target 4 dispatches. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| # Slice: run-fidelity | ||
|
|
||
| _Parent project `projects/drive-judge-harness/`. Outcome this slice contributes: the harness records a **faithful** run — correct `agent_id`, a real wall-clock signal, and a trace set scoped to what the run actually emitted — so the corpus the judge calibrates against and the A/B engine ranks on isn't polluted or blank. Fixes the three fidelity defects the first live `run-arm` exposed._ | ||
|
|
||
| ## At a glance | ||
|
|
||
| The first live run (composer-2.5, i12-halt) proved the pipeline but mis-recorded the run: `agent_id: null`, `tokens` all-zero, and `collected_trace_paths` containing 5 pre-existing committed traces from the base checkout plus 1 real one. This slice fixes the recordable defects and honestly documents the one that isn't recordable: | ||
|
|
||
| - **`agent_id`** is read from the stream `status` message (where the local runtime actually puts it), not the `wait()` outcome. | ||
| - **Wall-clock** (`durationMs` from the outcome) is captured as `wall_clock_ms` — the primary Tier-2 efficiency metric, since tokens are unavailable. | ||
| - **`collect-run`** returns only traces *emitted during the run*, not every schema-valid `.jsonl` in the checkout. | ||
| - **Tokens** stay `null` for local runs with an explicit note + documented SDK limitation (spike `2026-05-31-sdk-token-usage-retrieval.md`). | ||
|
|
||
| ## Chosen design | ||
|
|
||
| Ground-truth shapes from the spike probe (`@cursor/sdk@1.0.15`, local runtime): | ||
| - stream `status` → `{ type:"status", agent_id, run_id, status }` | ||
| - stream `assistant` → `{ type:"assistant", agent_id, run_id, message }` | ||
| - outcome (`wait()`) → `{ id, status, result, model, durationMs }` (no `agent_id`, no tokens) | ||
|
|
||
| ### 1. `sdk-events.ts` — extract the pure mappers (no SDK import) | ||
|
|
||
| Today the message/outcome mappers (`extractUsage`, `extractText`, `adaptOutcome`, `toStreamEvent`) live inside `sdk-adapter.ts`, which `import`s `@cursor/sdk` at module top — so they can't be unit-tested without the SDK installed. Move them into a new **`sdk-events.ts`** that imports nothing from the SDK and operates over `unknown`. `sdk-adapter.ts` imports them. This is what lets the fixes be test-first while preserving the live-execution gate (SDK reached only via `sdk-adapter.ts`'s dynamic import). | ||
|
|
||
| `sdk-events.ts` exports pure functions, unit-tested against the **real captured shapes**: | ||
| - `streamEventFromMessage(msg) -> RunStreamEvent` — maps `status`/`assistant` (real shapes) and keeps the `turn-ended` branch for the cloud runtime (still valid if ever used). | ||
| - `agentIdFromMessage(msg) -> string | null` — reads snake_case `agent_id`. | ||
| - `outcomeFromResult(raw) -> { status, runId, durationMs }` — reads `id`→runId, `status`, `durationMs` (number|null). | ||
|
|
||
| ### 2. `run-one-brief.ts` — capture agent_id + wall-clock | ||
|
|
||
| `RunOutcome` gains `durationMs: number | null`. The adapter captures `agent_id` from the **first stream message that carries one** (run-one-brief drains the stream before `wait()`, so it's available), and `wait()` returns it as `agentId`. `durationMs` flows from `outcomeFromResult`. No behaviour change to the dry-run/gate paths. | ||
|
|
||
| ### 3. `manifest.ts` — wall-clock + honest token note | ||
|
|
||
| Add `wall_clock_ms: number | null` (from `durationMs`). When `tokens` is `null` on a *finished live* run, append a note: `"tokens unavailable: @cursor/sdk local runtime emits no usage events (see spike 2026-05-31)"`. `tokens` field stays (null for local). | ||
|
|
||
| ### 4. `collect-run.ts` — scope to run-emitted traces | ||
|
|
||
| `PreparedRun` gains `preexistingTracePaths: string[]` — the set of `*.jsonl` present under `runDir` immediately after `prepareRun`'s baseline commit (the base checkout's committed traces). `collectRun` excludes that set, so `tracePaths` contains only traces the run produced. This is deterministic (no mtime/clock reliance) and robust to gitignored trace locations (e.g. `wip/drive-trace/`, where the real trace landed). `agent_id` matching then runs over the run-emitted set only. | ||
|
|
||
| ## Coherence rationale | ||
|
|
||
| One reviewer holds it in one sitting: every change serves "record the run faithfully," and they're entangled — the `agent_id` fix is what makes `collect-run`'s matching work, the mapper extraction is what makes both testable, and the wall-clock capture is the efficiency metric that stands in for the tokens the SDK won't give us. Rolls back as one unit (one new pure module + additive manifest/outcome fields + a `collect-run` scoping change). Touches no production package. | ||
|
|
||
| ## Scope | ||
|
|
||
| **In:** new `sdk-events.ts` (+ tests with real-shape fixtures); `sdk-adapter.ts` (import the mappers, capture stream `agent_id`); `run-one-brief.ts` (`RunOutcome.durationMs`, agent_id wiring); `manifest.ts` (`wall_clock_ms` + token note); `collect-run.ts` + `prepare-run.ts` (`preexistingTracePaths` snapshot + exclusion); `run-arm.ts` (thread `wall_clock_ms` into the enriched manifest); the spike artifact; SKILL.md / KNOWN-ISSUES note on the token gap; new tests wired into `test:scripts`. | ||
|
|
||
| **Out:** a non-SDK token source (Cursor admin/usage API, CLI telemetry) — deferred, out of scope (spike decision). The k=N A/B loop, aggregation, CI gate — TML-2737. The judge — TML-2736. | ||
|
|
||
| ## Pre-investigated edge cases | ||
|
|
||
| | Edge case | Disposition | Notes | | ||
| |---|---|---| | ||
| | Local runtime emits no usage event | Documented, not fixed | Confirmed by spike; `tokens: null` + note is the honest record. | | ||
| | Real trace landed in gitignored `wip/drive-trace/` | Drove the design | Snapshot-exclusion (not git-diff) is why scoping works for gitignored traces. | | ||
| | `agent_id` present on stream but not outcome | Core of the fix | Capture from the stream message, not `wait()`. | | ||
| | Multiple run-emitted traces remain after exclusion | Matching handles it | `agent_id` match, else newest, over the run-emitted set. | | ||
|
|
||
| ## Slice-specific done conditions | ||
|
|
||
| - [ ] A test feeds the **real captured** `status`/`assistant`/outcome shapes (from the spike) through `sdk-events.ts` and asserts `agent_id` + `durationMs` extraction — with `@cursor/sdk` not installed. | ||
| - [ ] A `collect-run` test with a baseline-committed trace + a run-emitted trace asserts only the latter is returned. | ||
|
|
||
| ## Open Questions | ||
|
|
||
| 1. **Snapshot `preexistingTracePaths` in `prepare-run` vs re-scan in `collect-run`?** Working position: snapshot in `prepare-run` (deterministic, captures the exact pre-run state) and pass it through `PreparedRun`. Re-scanning in `collect-run` would race any late base writes. | ||
|
|
||
| ## References | ||
|
|
||
| - Parent project: `projects/drive-judge-harness/spec.md` | ||
| - Spike: `projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md` | ||
| - Linear: [TML-2757](https://linear.app/prisma-company/issue/TML-2757) (blocks TML-2737) | ||
| - Surfaces: `skills-contrib/drive-judge-harness/{sdk-adapter,run-one-brief,manifest,collect-run,prepare-run,run-arm}.ts` | ||
| - First-run evidence: manifest at `run-arm-i12-…/run-manifest.json` (agent_id null, tokens 0, polluted trace list) |
26 changes: 26 additions & 0 deletions
26
projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # Spike: can per-run token usage be retrieved from `@cursor/sdk` for a local run? | ||
|
|
||
| **Date:** 2026-05-31 · **Trigger:** the first live `run-arm` (composer-2.5, i12-halt) returned `tokens: {all zero}`. **Question:** is the token signal — our stated #1 efficiency metric after correctness — obtainable from the SDK for a *local-runtime* run, via the stream, the run outcome, the `analytics` surface, or the cloud-API `getRun`? | ||
|
|
||
| ## Answer | ||
|
|
||
| **No. Token usage is not retrievable via the `@cursor/sdk` public surface for local runs, by any path.** Wall-clock (`durationMs`) is available and becomes the primary efficiency metric; `tokens` is honestly `null` from the runtime. | ||
|
|
||
| ## Evidence (`@cursor/sdk@1.0.15`) | ||
|
|
||
| A throwaway probe spawned a trivial local run and dumped every stream message + the `wait()` outcome: | ||
|
|
||
| - **Stream messages** — only two types across the whole run: `status` `{ type, agent_id, run_id, status }` and `assistant` `{ type, agent_id, run_id, message }`. **No `turnEnded` / `usage` event is emitted by the local runtime.** (The SDK *does* define a `usage: { inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens }` schema, but it rides on a `turnEnded` update that only the **cloud** runtime streams.) | ||
| - **Run outcome** (`wait()`) — `{ id, status, result, model, durationMs }`. Carries wall-clock (`durationMs`), no tokens. `agent_id` is **not** here; it is on the stream messages. | ||
| - **`analytics.d.ts`** — emit-only outbound telemetry (`trackSdkRunCreated/Completed/SendLatency`, `flushSdkAnalytics`). No read-back API. The event props (`SdkRunCreatedProps`, `SdkRunCompletedProps`, `SdkRunSendLatencyProps`) carry `turn_count`, latency, `end_reason` — **no token counts**. | ||
| - **`cloud-api-client` `getRun({agentId,runId}) → V1Run`** — `{ id, agentId, status, createdAt, updatedAt, durationMs?, result?, git? }`. **No tokens.** `RunResultMetadata` and `executor-types.d.ts` have zero token/usage/cost fields. (Also a cloud-agent query; a local run is not necessarily registered there.) | ||
|
|
||
| ## Decision (re-route) | ||
|
|
||
| Proceed on **option (d)**: | ||
|
|
||
| - Capture `durationMs` (wall-clock) from the run outcome → the primary Tier-2 efficiency metric for local runs. | ||
| - `tokens` stays `null` for local runs, with an explicit manifest note + a documented SDK limitation (consumption gotcha). Not a bug in our extraction — there is nothing to extract. | ||
| - A future token source must come from **outside** the SDK (a Cursor admin/usage API, or CLI-internal telemetry). Out of scope for the fix slice. | ||
|
|
||
| Companion clean fixes (same slice): capture `agent_id` from the stream `status` message; scope `collect-run` to traces emitted *during* the run (exclude baseline-committed traces). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.