prisma · wmadden-electric · May 31, 2026 · May 31, 2026 · May 31, 2026 · May 31, 2026
@@ -38,7 +38,7 @@
     "lint:docs": "node scripts/validate-package-readmes.mjs",
     "lint:manifests": "node scripts/validate-package-manifests.mjs",
     "lint:workflows": "node scripts/lint-workflow-triggers.mjs",
-    "test:scripts": "node --test scripts/lint-workflow-triggers.test.mjs scripts/validate-skills.test.mjs scripts/determine-version-utils.test.ts scripts/check-upgrade-coverage.test.mjs scripts/set-version-utils.test.ts scripts/check-publish-deps-pn-pins.test.mjs scripts/publish-packages-utils.test.mjs scripts/check-clean-tree.test.mjs scripts/lint-casts.test.mjs scripts/sync-agent-rules.test.mjs skills-contrib/drive-diagnose-run/test/load.test.ts skills-contrib/drive-diagnose-run/test/metrics.test.ts skills-contrib/drive-diagnose-run/test/invariants.test.ts skills-contrib/drive-diagnose-run/test/cascade-brief.test.ts skills-contrib/drive-diagnose-run/test/report.test.ts skills-contrib/drive-diagnose-run/test/posthoc.test.ts skills-contrib/drive-diagnose-run/test/scorecard.test.ts skills-contrib/drive-record-traces/test/emit.test.ts skills-contrib/drive-judge-harness/test/usage.test.ts skills-contrib/drive-judge-harness/test/manifest.test.ts skills-contrib/drive-judge-harness/test/load-brief.test.ts skills-contrib/drive-judge-harness/test/run-one-brief.test.ts skills-contrib/drive-judge-harness/test/validate-parser.test.ts skills-contrib/drive-judge-harness/test/judge-model-sdk.test.ts skills-contrib/drive-judge-harness/test/rubric-correctness.test.ts skills-contrib/drive-judge-harness/test/classify-failure.test.ts skills-contrib/drive-judge-harness/test/classify-operator.test.ts skills-contrib/drive-judge-harness/test/emit-correctness.test.ts skills-contrib/drive-judge-harness/test/calibration.test.ts skills-contrib/drive-judge-harness/test/prepare-run.test.ts skills-contrib/drive-judge-harness/test/collect-run.test.ts skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts skills-contrib/drive-judge-harness/test/run-arm.test.ts",
+    "test:scripts": "node --test scripts/lint-workflow-triggers.test.mjs scripts/validate-skills.test.mjs scripts/determine-version-utils.test.ts scripts/check-upgrade-coverage.test.mjs scripts/set-version-utils.test.ts scripts/check-publish-deps-pn-pins.test.mjs scripts/publish-packages-utils.test.mjs scripts/check-clean-tree.test.mjs scripts/lint-casts.test.mjs scripts/sync-agent-rules.test.mjs skills-contrib/drive-diagnose-run/test/load.test.ts skills-contrib/drive-diagnose-run/test/metrics.test.ts skills-contrib/drive-diagnose-run/test/invariants.test.ts skills-contrib/drive-diagnose-run/test/cascade-brief.test.ts skills-contrib/drive-diagnose-run/test/report.test.ts skills-contrib/drive-diagnose-run/test/posthoc.test.ts skills-contrib/drive-diagnose-run/test/scorecard.test.ts skills-contrib/drive-record-traces/test/emit.test.ts skills-contrib/drive-judge-harness/test/usage.test.ts skills-contrib/drive-judge-harness/test/manifest.test.ts skills-contrib/drive-judge-harness/test/load-brief.test.ts skills-contrib/drive-judge-harness/test/run-one-brief.test.ts skills-contrib/drive-judge-harness/test/sdk-events.test.ts skills-contrib/drive-judge-harness/test/validate-parser.test.ts skills-contrib/drive-judge-harness/test/judge-model-sdk.test.ts skills-contrib/drive-judge-harness/test/rubric-correctness.test.ts skills-contrib/drive-judge-harness/test/classify-failure.test.ts skills-contrib/drive-judge-harness/test/classify-operator.test.ts skills-contrib/drive-judge-harness/test/emit-correctness.test.ts skills-contrib/drive-judge-harness/test/calibration.test.ts skills-contrib/drive-judge-harness/test/prepare-run.test.ts skills-contrib/drive-judge-harness/test/collect-run.test.ts skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts skills-contrib/drive-judge-harness/test/run-arm.test.ts",
     "drive:diagnose": "node skills-contrib/drive-diagnose-run/cli.ts",
     "drive:emit": "node skills-contrib/drive-record-traces/emit.ts",
     "drive:run-brief": "node skills-contrib/drive-judge-harness/run-one-brief.ts",

@@ -0,0 +1,37 @@
+# Plan: run-fidelity (TML-2757)
+
+Test-first throughout. The live SDK is reached only via `sdk-adapter.ts`'s dynamic import; all new logic lives in no-SDK-import modules so it's unit-testable with `@cursor/sdk` absent. Spike `2026-05-31-sdk-token-usage-retrieval.md` is committed in dispatch 1.
+
+## Dispatches
+
+### D1 — `sdk-events.ts`: pure mappers + real-shape extraction (test-first)
+- **Outcome:** message/outcome mapping lives in a no-SDK module, with `agent_id` and `durationMs` extracted from the **real captured shapes**.
+- Move `extractText` / `toStreamEvent` / `adaptOutcome` (and the now-dead `extractUsage`) out of `sdk-adapter.ts` into new `sdk-events.ts` (imports nothing from the SDK; operates over `unknown`). Add `agentIdFromMessage`, `outcomeFromResult` (→ `{status,runId,durationMs}`), `streamEventFromMessage`.
+- Tests (`test/sdk-events.test.ts`): feed the real `status`/`assistant`/outcome fixtures from the spike; assert `agent_id`, `durationMs`, stream mapping. Runs with the SDK uninstalled.
+- `sdk-adapter.ts` imports the mappers (no behaviour change).
+- Commit the spike artifact here.
+- **Builds on:** merged run-setup. **Hands to:** D2.
+
+### D2 — capture agent_id + wall-clock end-to-end (test-first)
+- **Outcome:** a finished run records the real `agent_id` and `wall_clock_ms`.
+- `run-one-brief.ts`: `RunOutcome` gains `durationMs: number | null`; adapter captures `agent_id` from the first stream message carrying one and returns it from `wait()`.
+- `manifest.ts`: add `wall_clock_ms`; add the token-unavailable note when `tokens` is null on a finished live run. `run-arm.ts` threads `wall_clock_ms` into the enriched manifest.
+- Tests: outcome→manifest mapping populates `agent_id` + `wall_clock_ms`; null-token note present.
+- **Builds on:** D1. **Hands to:** D3.
+
+### D3 — `collect-run` run-scoping (test-first)
+- **Outcome:** `collectRun` returns only traces emitted during the run.
+- `prepare-run.ts`: snapshot `*.jsonl` under `runDir` after the baseline commit → `PreparedRun.preexistingTracePaths`.
+- `collect-run.ts`: exclude `preexistingTracePaths`; `agent_id` match over the remainder.
+- Tests: baseline-committed trace + run-emitted trace → only the latter returned (cover a gitignored-path trace).
+- **Builds on:** D2. **Hands to:** D4.
+
+### D4 — docs + gates + PR
+- **Outcome:** token gap documented; suite green; PR open.
+- SKILL.md / KNOWN-ISSUES: token gap (link spike) + wall-clock-as-primary note.
+- Wire new tests into `test:scripts`; run `pnpm -w typecheck`, `pnpm -w lint`, `pnpm -w test:scripts`; fix fallout.
+- Stage explicitly, sign off, push to `tml-2757-run-fidelity`, open PR (create-pr skill).
+- **Builds on:** D3.
+
+## Sequencing
+Serial: D1 unlocks testability, D2 consumes the extractors, D3 is independent of D2 but shares the manifest touch (sequence after to avoid conflict), D4 closes. Target 4 dispatches.
@@ -0,0 +1,76 @@
+# Slice: run-fidelity
+
+_Parent project `projects/drive-judge-harness/`. Outcome this slice contributes: the harness records a **faithful** run — correct `agent_id`, a real wall-clock signal, and a trace set scoped to what the run actually emitted — so the corpus the judge calibrates against and the A/B engine ranks on isn't polluted or blank. Fixes the three fidelity defects the first live `run-arm` exposed._
+
+## At a glance
+
+The first live run (composer-2.5, i12-halt) proved the pipeline but mis-recorded the run: `agent_id: null`, `tokens` all-zero, and `collected_trace_paths` containing 5 pre-existing committed traces from the base checkout plus 1 real one. This slice fixes the recordable defects and honestly documents the one that isn't recordable:
+
+- **`agent_id`** is read from the stream `status` message (where the local runtime actually puts it), not the `wait()` outcome.
+- **Wall-clock** (`durationMs` from the outcome) is captured as `wall_clock_ms` — the primary Tier-2 efficiency metric, since tokens are unavailable.
+- **`collect-run`** returns only traces *emitted during the run*, not every schema-valid `.jsonl` in the checkout.
+- **Tokens** stay `null` for local runs with an explicit note + documented SDK limitation (spike `2026-05-31-sdk-token-usage-retrieval.md`).
+
+## Chosen design
+
+Ground-truth shapes from the spike probe (`@cursor/sdk@1.0.15`, local runtime):
+- stream `status` → `{ type:"status", agent_id, run_id, status }`
+- stream `assistant` → `{ type:"assistant", agent_id, run_id, message }`
+- outcome (`wait()`) → `{ id, status, result, model, durationMs }` (no `agent_id`, no tokens)
+
+### 1. `sdk-events.ts` — extract the pure mappers (no SDK import)
+
+Today the message/outcome mappers (`extractUsage`, `extractText`, `adaptOutcome`, `toStreamEvent`) live inside `sdk-adapter.ts`, which `import`s `@cursor/sdk` at module top — so they can't be unit-tested without the SDK installed. Move them into a new **`sdk-events.ts`** that imports nothing from the SDK and operates over `unknown`. `sdk-adapter.ts` imports them. This is what lets the fixes be test-first while preserving the live-execution gate (SDK reached only via `sdk-adapter.ts`'s dynamic import).
+
+`sdk-events.ts` exports pure functions, unit-tested against the **real captured shapes**:
+- `streamEventFromMessage(msg) -> RunStreamEvent` — maps `status`/`assistant` (real shapes) and keeps the `turn-ended` branch for the cloud runtime (still valid if ever used).
+- `agentIdFromMessage(msg) -> string | null` — reads snake_case `agent_id`.
+- `outcomeFromResult(raw) -> { status, runId, durationMs }` — reads `id`→runId, `status`, `durationMs` (number|null).
+
+### 2. `run-one-brief.ts` — capture agent_id + wall-clock
+
+`RunOutcome` gains `durationMs: number | null`. The adapter captures `agent_id` from the **first stream message that carries one** (run-one-brief drains the stream before `wait()`, so it's available), and `wait()` returns it as `agentId`. `durationMs` flows from `outcomeFromResult`. No behaviour change to the dry-run/gate paths.
+
+### 3. `manifest.ts` — wall-clock + honest token note
+
+Add `wall_clock_ms: number | null` (from `durationMs`). When `tokens` is `null` on a *finished live* run, append a note: `"tokens unavailable: @cursor/sdk local runtime emits no usage events (see spike 2026-05-31)"`. `tokens` field stays (null for local).
+
+### 4. `collect-run.ts` — scope to run-emitted traces
+
+`PreparedRun` gains `preexistingTracePaths: string[]` — the set of `*.jsonl` present under `runDir` immediately after `prepareRun`'s baseline commit (the base checkout's committed traces). `collectRun` excludes that set, so `tracePaths` contains only traces the run produced. This is deterministic (no mtime/clock reliance) and robust to gitignored trace locations (e.g. `wip/drive-trace/`, where the real trace landed). `agent_id` matching then runs over the run-emitted set only.
+
+## Coherence rationale
+
+One reviewer holds it in one sitting: every change serves "record the run faithfully," and they're entangled — the `agent_id` fix is what makes `collect-run`'s matching work, the mapper extraction is what makes both testable, and the wall-clock capture is the efficiency metric that stands in for the tokens the SDK won't give us. Rolls back as one unit (one new pure module + additive manifest/outcome fields + a `collect-run` scoping change). Touches no production package.
+
+## Scope
+
+**In:** new `sdk-events.ts` (+ tests with real-shape fixtures); `sdk-adapter.ts` (import the mappers, capture stream `agent_id`); `run-one-brief.ts` (`RunOutcome.durationMs`, agent_id wiring); `manifest.ts` (`wall_clock_ms` + token note); `collect-run.ts` + `prepare-run.ts` (`preexistingTracePaths` snapshot + exclusion); `run-arm.ts` (thread `wall_clock_ms` into the enriched manifest); the spike artifact; SKILL.md / KNOWN-ISSUES note on the token gap; new tests wired into `test:scripts`.
+
+**Out:** a non-SDK token source (Cursor admin/usage API, CLI telemetry) — deferred, out of scope (spike decision). The k=N A/B loop, aggregation, CI gate — TML-2737. The judge — TML-2736.
+
+## Pre-investigated edge cases
+
+| Edge case | Disposition | Notes |
+|---|---|---|
+| Local runtime emits no usage event | Documented, not fixed | Confirmed by spike; `tokens: null` + note is the honest record. |
+| Real trace landed in gitignored `wip/drive-trace/` | Drove the design | Snapshot-exclusion (not git-diff) is why scoping works for gitignored traces. |
+| `agent_id` present on stream but not outcome | Core of the fix | Capture from the stream message, not `wait()`. |
+| Multiple run-emitted traces remain after exclusion | Matching handles it | `agent_id` match, else newest, over the run-emitted set. |
+
+## Slice-specific done conditions
+
+- [ ] A test feeds the **real captured** `status`/`assistant`/outcome shapes (from the spike) through `sdk-events.ts` and asserts `agent_id` + `durationMs` extraction — with `@cursor/sdk` not installed.
+- [ ] A `collect-run` test with a baseline-committed trace + a run-emitted trace asserts only the latter is returned.
+
+## Open Questions
+
+1. **Snapshot `preexistingTracePaths` in `prepare-run` vs re-scan in `collect-run`?** Working position: snapshot in `prepare-run` (deterministic, captures the exact pre-run state) and pass it through `PreparedRun`. Re-scanning in `collect-run` would race any late base writes.
+
+## References
+
+- Parent project: `projects/drive-judge-harness/spec.md`
+- Spike: `projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md`
+- Linear: [TML-2757](https://linear.app/prisma-company/issue/TML-2757) (blocks TML-2737)
+- Surfaces: `skills-contrib/drive-judge-harness/{sdk-adapter,run-one-brief,manifest,collect-run,prepare-run,run-arm}.ts`
+- First-run evidence: manifest at `run-arm-i12-…/run-manifest.json` (agent_id null, tokens 0, polluted trace list)
@@ -0,0 +1,26 @@
+# Spike: can per-run token usage be retrieved from `@cursor/sdk` for a local run?
+
+**Date:** 2026-05-31 · **Trigger:** the first live `run-arm` (composer-2.5, i12-halt) returned `tokens: {all zero}`. **Question:** is the token signal — our stated #1 efficiency metric after correctness — obtainable from the SDK for a *local-runtime* run, via the stream, the run outcome, the `analytics` surface, or the cloud-API `getRun`?
+
+## Answer
+
+**No. Token usage is not retrievable via the `@cursor/sdk` public surface for local runs, by any path.** Wall-clock (`durationMs`) is available and becomes the primary efficiency metric; `tokens` is honestly `null` from the runtime.
+
+## Evidence (`@cursor/sdk@1.0.15`)
+
+A throwaway probe spawned a trivial local run and dumped every stream message + the `wait()` outcome:
+
+- **Stream messages** — only two types across the whole run: `status` `{ type, agent_id, run_id, status }` and `assistant` `{ type, agent_id, run_id, message }`. **No `turnEnded` / `usage` event is emitted by the local runtime.** (The SDK *does* define a `usage: { inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens }` schema, but it rides on a `turnEnded` update that only the **cloud** runtime streams.)
+- **Run outcome** (`wait()`) — `{ id, status, result, model, durationMs }`. Carries wall-clock (`durationMs`), no tokens. `agent_id` is **not** here; it is on the stream messages.
+- **`analytics.d.ts`** — emit-only outbound telemetry (`trackSdkRunCreated/Completed/SendLatency`, `flushSdkAnalytics`). No read-back API. The event props (`SdkRunCreatedProps`, `SdkRunCompletedProps`, `SdkRunSendLatencyProps`) carry `turn_count`, latency, `end_reason` — **no token counts**.
+- **`cloud-api-client` `getRun({agentId,runId}) → V1Run`** — `{ id, agentId, status, createdAt, updatedAt, durationMs?, result?, git? }`. **No tokens.** `RunResultMetadata` and `executor-types.d.ts` have zero token/usage/cost fields. (Also a cloud-agent query; a local run is not necessarily registered there.)
+
+## Decision (re-route)
+
+Proceed on **option (d)**:
+
+- Capture `durationMs` (wall-clock) from the run outcome → the primary Tier-2 efficiency metric for local runs.
+- `tokens` stays `null` for local runs, with an explicit manifest note + a documented SDK limitation (consumption gotcha). Not a bug in our extraction — there is nothing to extract.
+- A future token source must come from **outside** the SDK (a Cursor admin/usage API, or CLI-internal telemetry). Out of scope for the fix slice.
+
+Companion clean fixes (same slice): capture `agent_id` from the stream `status` message; scope `collect-run` to traces emitted *during* the run (exclude baseline-committed traces).
@@ -52,3 +52,21 @@ The token-usage signal this harness needs comes from `TurnEndedUpdate.usage`, wh
 - read the per-turn `usage` field through a small, explicitly-bounded structural view in `sdk-adapter.ts` (guarded at runtime; no bare casts) rather than a fabricated full mirror of the SDK's types.
 
 When upstream ships resolvable types, replace that structural view with the real `TurnEndedUpdate` import and delete the workaround.
+
+## 2. The local runtime emits no token-usage signal at all
+
+Distinct from (and more fundamental than) the type-resolution gap above: even at **runtime**, the `@cursor/sdk` *local* runtime never emits a usage signal, so there is nothing to read regardless of types.
+
+Confirmed by a probe (spike `projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md`) against `@cursor/sdk@1.0.15`:
+
+- The local `run.stream()` yields only `status` and `assistant` messages — **no `turnEnded`/`usage` event** (that update is streamed only by the *cloud* runtime).
+- The `run.wait()` outcome (`{ id, status, result, model, durationMs }`) carries wall-clock but **no tokens**.
+- The cloud `getRun → V1Run` (`{ id, agentId, status, createdAt, updatedAt, durationMs?, result?, git? }`), `RunResultMetadata`, and the `analytics` surface (emit-only `trackSdkRun*`; props carry `turn_count`/latency/`end_reason`) all carry **no token counts**.
+
+### Impact on this harness
+
+For local runs, `tokens` is `null` (with a manifest note), and **`wall_clock_ms` (the outcome's `durationMs`) is the primary efficiency metric.** `accumulateUsage` remains wired, so usage flows automatically if a cloud run (which does stream `turnEnded`) is used, or once a non-SDK local token source exists.
+
+### Suggested fix (upstream)
+
+Stream `turnEnded` (with `usage`) from the local runtime as the cloud runtime already does, or expose per-run token counts on the run outcome / a queryable usage API.
@@ -168,6 +168,19 @@ trace via the emitter. The spawned orchestrator self-instruments its Drive
 methodology events into `--trace-file` via `drive-record-traces`; the harness
 owns only the token manifest.
 
+**Tokens are unavailable for local runs.** The `@cursor/sdk` *local* runtime
+emits no per-turn usage events at all — no `turnEnded`, and nothing in the run
+outcome, the `getRun`/`V1Run` cloud query, or the `analytics` surface carries
+token counts (confirmed by the spike at
+`projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md`,
+and see KNOWN-ISSUES.md). So `tokens` is honestly `null` for local runs, with a
+note recorded on the manifest. **Wall-clock (`wall_clock_ms`, from the run
+outcome's `durationMs`) is therefore the primary Tier-2 efficiency metric.** A
+future token source would have to come from outside the SDK (a Cursor
+admin/usage API, or CLI-internal telemetry); `accumulateUsage` stays wired so the
+signal flows automatically if the cloud runtime (which *does* stream usage) is
+used, or once a local source exists.
+
 ## The LLM judge (`judge/`)
 
 A bespoke-minimal grader that turns the run's artifacts (diff + golden