Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
"lint:docs": "node scripts/validate-package-readmes.mjs",
"lint:manifests": "node scripts/validate-package-manifests.mjs",
"lint:workflows": "node scripts/lint-workflow-triggers.mjs",
"test:scripts": "node --test scripts/lint-workflow-triggers.test.mjs scripts/validate-skills.test.mjs scripts/determine-version-utils.test.ts scripts/check-upgrade-coverage.test.mjs scripts/set-version-utils.test.ts scripts/check-publish-deps-pn-pins.test.mjs scripts/publish-packages-utils.test.mjs scripts/check-clean-tree.test.mjs scripts/lint-casts.test.mjs scripts/sync-agent-rules.test.mjs skills-contrib/drive-diagnose-run/test/load.test.ts skills-contrib/drive-diagnose-run/test/metrics.test.ts skills-contrib/drive-diagnose-run/test/invariants.test.ts skills-contrib/drive-diagnose-run/test/cascade-brief.test.ts skills-contrib/drive-diagnose-run/test/report.test.ts skills-contrib/drive-diagnose-run/test/posthoc.test.ts skills-contrib/drive-diagnose-run/test/scorecard.test.ts skills-contrib/drive-record-traces/test/emit.test.ts skills-contrib/drive-judge-harness/test/usage.test.ts skills-contrib/drive-judge-harness/test/manifest.test.ts skills-contrib/drive-judge-harness/test/load-brief.test.ts skills-contrib/drive-judge-harness/test/run-one-brief.test.ts skills-contrib/drive-judge-harness/test/validate-parser.test.ts skills-contrib/drive-judge-harness/test/judge-model-sdk.test.ts skills-contrib/drive-judge-harness/test/rubric-correctness.test.ts skills-contrib/drive-judge-harness/test/classify-failure.test.ts skills-contrib/drive-judge-harness/test/classify-operator.test.ts skills-contrib/drive-judge-harness/test/emit-correctness.test.ts skills-contrib/drive-judge-harness/test/calibration.test.ts skills-contrib/drive-judge-harness/test/prepare-run.test.ts skills-contrib/drive-judge-harness/test/collect-run.test.ts skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts skills-contrib/drive-judge-harness/test/run-arm.test.ts",
"test:scripts": "node --test scripts/lint-workflow-triggers.test.mjs scripts/validate-skills.test.mjs scripts/determine-version-utils.test.ts scripts/check-upgrade-coverage.test.mjs scripts/set-version-utils.test.ts scripts/check-publish-deps-pn-pins.test.mjs scripts/publish-packages-utils.test.mjs scripts/check-clean-tree.test.mjs scripts/lint-casts.test.mjs scripts/sync-agent-rules.test.mjs skills-contrib/drive-diagnose-run/test/load.test.ts skills-contrib/drive-diagnose-run/test/metrics.test.ts skills-contrib/drive-diagnose-run/test/invariants.test.ts skills-contrib/drive-diagnose-run/test/cascade-brief.test.ts skills-contrib/drive-diagnose-run/test/report.test.ts skills-contrib/drive-diagnose-run/test/posthoc.test.ts skills-contrib/drive-diagnose-run/test/scorecard.test.ts skills-contrib/drive-record-traces/test/emit.test.ts skills-contrib/drive-judge-harness/test/usage.test.ts skills-contrib/drive-judge-harness/test/manifest.test.ts skills-contrib/drive-judge-harness/test/load-brief.test.ts skills-contrib/drive-judge-harness/test/run-one-brief.test.ts skills-contrib/drive-judge-harness/test/sdk-events.test.ts skills-contrib/drive-judge-harness/test/validate-parser.test.ts skills-contrib/drive-judge-harness/test/judge-model-sdk.test.ts skills-contrib/drive-judge-harness/test/rubric-correctness.test.ts skills-contrib/drive-judge-harness/test/classify-failure.test.ts skills-contrib/drive-judge-harness/test/classify-operator.test.ts skills-contrib/drive-judge-harness/test/emit-correctness.test.ts skills-contrib/drive-judge-harness/test/calibration.test.ts skills-contrib/drive-judge-harness/test/prepare-run.test.ts skills-contrib/drive-judge-harness/test/collect-run.test.ts skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts skills-contrib/drive-judge-harness/test/run-arm.test.ts",
"drive:diagnose": "node skills-contrib/drive-diagnose-run/cli.ts",
"drive:emit": "node skills-contrib/drive-record-traces/emit.ts",
"drive:run-brief": "node skills-contrib/drive-judge-harness/run-one-brief.ts",
Expand Down
37 changes: 37 additions & 0 deletions projects/drive-judge-harness/slices/run-fidelity/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Plan: run-fidelity (TML-2757)

Test-first throughout. The live SDK is reached only via `sdk-adapter.ts`'s dynamic import; all new logic lives in no-SDK-import modules so it's unit-testable with `@cursor/sdk` absent. Spike `2026-05-31-sdk-token-usage-retrieval.md` is committed in dispatch 1.

## Dispatches

### D1 — `sdk-events.ts`: pure mappers + real-shape extraction (test-first)
- **Outcome:** message/outcome mapping lives in a no-SDK module, with `agent_id` and `durationMs` extracted from the **real captured shapes**.
- Move `extractText` / `toStreamEvent` / `adaptOutcome` (and the now-dead `extractUsage`) out of `sdk-adapter.ts` into new `sdk-events.ts` (imports nothing from the SDK; operates over `unknown`). Add `agentIdFromMessage`, `outcomeFromResult` (→ `{status,runId,durationMs}`), `streamEventFromMessage`.
- Tests (`test/sdk-events.test.ts`): feed the real `status`/`assistant`/outcome fixtures from the spike; assert `agent_id`, `durationMs`, stream mapping. Runs with the SDK uninstalled.
- `sdk-adapter.ts` imports the mappers (no behaviour change).
- Commit the spike artifact here.
- **Builds on:** merged run-setup. **Hands to:** D2.

### D2 — capture agent_id + wall-clock end-to-end (test-first)
- **Outcome:** a finished run records the real `agent_id` and `wall_clock_ms`.
- `run-one-brief.ts`: `RunOutcome` gains `durationMs: number | null`; adapter captures `agent_id` from the first stream message carrying one and returns it from `wait()`.
- `manifest.ts`: add `wall_clock_ms`; add the token-unavailable note when `tokens` is null on a finished live run. `run-arm.ts` threads `wall_clock_ms` into the enriched manifest.
- Tests: outcome→manifest mapping populates `agent_id` + `wall_clock_ms`; null-token note present.
- **Builds on:** D1. **Hands to:** D3.

### D3 — `collect-run` run-scoping (test-first)
- **Outcome:** `collectRun` returns only traces emitted during the run.
- `prepare-run.ts`: snapshot `*.jsonl` under `runDir` after the baseline commit → `PreparedRun.preexistingTracePaths`.
- `collect-run.ts`: exclude `preexistingTracePaths`; `agent_id` match over the remainder.
- Tests: baseline-committed trace + run-emitted trace → only the latter returned (cover a gitignored-path trace).
- **Builds on:** D2. **Hands to:** D4.

### D4 — docs + gates + PR
- **Outcome:** token gap documented; suite green; PR open.
- SKILL.md / KNOWN-ISSUES: token gap (link spike) + wall-clock-as-primary note.
- Wire new tests into `test:scripts`; run `pnpm -w typecheck`, `pnpm -w lint`, `pnpm -w test:scripts`; fix fallout.
- Stage explicitly, sign off, push to `tml-2757-run-fidelity`, open PR (create-pr skill).
- **Builds on:** D3.

## Sequencing
Serial: D1 unlocks testability, D2 consumes the extractors, D3 is independent of D2 but shares the manifest touch (sequence after to avoid conflict), D4 closes. Target 4 dispatches.
76 changes: 76 additions & 0 deletions projects/drive-judge-harness/slices/run-fidelity/spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Slice: run-fidelity

_Parent project `projects/drive-judge-harness/`. Outcome this slice contributes: the harness records a **faithful** run — correct `agent_id`, a real wall-clock signal, and a trace set scoped to what the run actually emitted — so the corpus the judge calibrates against and the A/B engine ranks on isn't polluted or blank. Fixes the three fidelity defects the first live `run-arm` exposed._

## At a glance

The first live run (composer-2.5, i12-halt) proved the pipeline but mis-recorded the run: `agent_id: null`, `tokens` all-zero, and `collected_trace_paths` containing 5 pre-existing committed traces from the base checkout plus 1 real one. This slice fixes the recordable defects and honestly documents the one that isn't recordable:

- **`agent_id`** is read from the stream `status` message (where the local runtime actually puts it), not the `wait()` outcome.
- **Wall-clock** (`durationMs` from the outcome) is captured as `wall_clock_ms` — the primary Tier-2 efficiency metric, since tokens are unavailable.
- **`collect-run`** returns only traces *emitted during the run*, not every schema-valid `.jsonl` in the checkout.
- **Tokens** stay `null` for local runs with an explicit note + documented SDK limitation (spike `2026-05-31-sdk-token-usage-retrieval.md`).

## Chosen design

Ground-truth shapes from the spike probe (`@cursor/sdk@1.0.15`, local runtime):
- stream `status` → `{ type:"status", agent_id, run_id, status }`
- stream `assistant` → `{ type:"assistant", agent_id, run_id, message }`
- outcome (`wait()`) → `{ id, status, result, model, durationMs }` (no `agent_id`, no tokens)

### 1. `sdk-events.ts` — extract the pure mappers (no SDK import)

Today the message/outcome mappers (`extractUsage`, `extractText`, `adaptOutcome`, `toStreamEvent`) live inside `sdk-adapter.ts`, which `import`s `@cursor/sdk` at module top — so they can't be unit-tested without the SDK installed. Move them into a new **`sdk-events.ts`** that imports nothing from the SDK and operates over `unknown`. `sdk-adapter.ts` imports them. This is what lets the fixes be test-first while preserving the live-execution gate (SDK reached only via `sdk-adapter.ts`'s dynamic import).

`sdk-events.ts` exports pure functions, unit-tested against the **real captured shapes**:
- `streamEventFromMessage(msg) -> RunStreamEvent` — maps `status`/`assistant` (real shapes) and keeps the `turn-ended` branch for the cloud runtime (still valid if ever used).
- `agentIdFromMessage(msg) -> string | null` — reads snake_case `agent_id`.
- `outcomeFromResult(raw) -> { status, runId, durationMs }` — reads `id`→runId, `status`, `durationMs` (number|null).

### 2. `run-one-brief.ts` — capture agent_id + wall-clock

`RunOutcome` gains `durationMs: number | null`. The adapter captures `agent_id` from the **first stream message that carries one** (run-one-brief drains the stream before `wait()`, so it's available), and `wait()` returns it as `agentId`. `durationMs` flows from `outcomeFromResult`. No behaviour change to the dry-run/gate paths.

### 3. `manifest.ts` — wall-clock + honest token note

Add `wall_clock_ms: number | null` (from `durationMs`). When `tokens` is `null` on a *finished live* run, append a note: `"tokens unavailable: @cursor/sdk local runtime emits no usage events (see spike 2026-05-31)"`. `tokens` field stays (null for local).

### 4. `collect-run.ts` — scope to run-emitted traces

`PreparedRun` gains `preexistingTracePaths: string[]` — the set of `*.jsonl` present under `runDir` immediately after `prepareRun`'s baseline commit (the base checkout's committed traces). `collectRun` excludes that set, so `tracePaths` contains only traces the run produced. This is deterministic (no mtime/clock reliance) and robust to gitignored trace locations (e.g. `wip/drive-trace/`, where the real trace landed). `agent_id` matching then runs over the run-emitted set only.

## Coherence rationale

One reviewer holds it in one sitting: every change serves "record the run faithfully," and they're entangled — the `agent_id` fix is what makes `collect-run`'s matching work, the mapper extraction is what makes both testable, and the wall-clock capture is the efficiency metric that stands in for the tokens the SDK won't give us. Rolls back as one unit (one new pure module + additive manifest/outcome fields + a `collect-run` scoping change). Touches no production package.

## Scope

**In:** new `sdk-events.ts` (+ tests with real-shape fixtures); `sdk-adapter.ts` (import the mappers, capture stream `agent_id`); `run-one-brief.ts` (`RunOutcome.durationMs`, agent_id wiring); `manifest.ts` (`wall_clock_ms` + token note); `collect-run.ts` + `prepare-run.ts` (`preexistingTracePaths` snapshot + exclusion); `run-arm.ts` (thread `wall_clock_ms` into the enriched manifest); the spike artifact; SKILL.md / KNOWN-ISSUES note on the token gap; new tests wired into `test:scripts`.

**Out:** a non-SDK token source (Cursor admin/usage API, CLI telemetry) — deferred, out of scope (spike decision). The k=N A/B loop, aggregation, CI gate — TML-2737. The judge — TML-2736.

## Pre-investigated edge cases

| Edge case | Disposition | Notes |
|---|---|---|
| Local runtime emits no usage event | Documented, not fixed | Confirmed by spike; `tokens: null` + note is the honest record. |
| Real trace landed in gitignored `wip/drive-trace/` | Drove the design | Snapshot-exclusion (not git-diff) is why scoping works for gitignored traces. |
| `agent_id` present on stream but not outcome | Core of the fix | Capture from the stream message, not `wait()`. |
| Multiple run-emitted traces remain after exclusion | Matching handles it | `agent_id` match, else newest, over the run-emitted set. |

## Slice-specific done conditions

- [ ] A test feeds the **real captured** `status`/`assistant`/outcome shapes (from the spike) through `sdk-events.ts` and asserts `agent_id` + `durationMs` extraction — with `@cursor/sdk` not installed.
- [ ] A `collect-run` test with a baseline-committed trace + a run-emitted trace asserts only the latter is returned.

## Open Questions

1. **Snapshot `preexistingTracePaths` in `prepare-run` vs re-scan in `collect-run`?** Working position: snapshot in `prepare-run` (deterministic, captures the exact pre-run state) and pass it through `PreparedRun`. Re-scanning in `collect-run` would race any late base writes.

## References

- Parent project: `projects/drive-judge-harness/spec.md`
- Spike: `projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md`
- Linear: [TML-2757](https://linear.app/prisma-company/issue/TML-2757) (blocks TML-2737)
- Surfaces: `skills-contrib/drive-judge-harness/{sdk-adapter,run-one-brief,manifest,collect-run,prepare-run,run-arm}.ts`
- First-run evidence: manifest at `run-arm-i12-…/run-manifest.json` (agent_id null, tokens 0, polluted trace list)
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Spike: can per-run token usage be retrieved from `@cursor/sdk` for a local run?

**Date:** 2026-05-31 · **Trigger:** the first live `run-arm` (composer-2.5, i12-halt) returned `tokens: {all zero}`. **Question:** is the token signal — our stated #1 efficiency metric after correctness — obtainable from the SDK for a *local-runtime* run, via the stream, the run outcome, the `analytics` surface, or the cloud-API `getRun`?

## Answer

**No. Token usage is not retrievable via the `@cursor/sdk` public surface for local runs, by any path.** Wall-clock (`durationMs`) is available and becomes the primary efficiency metric; `tokens` is honestly `null` from the runtime.

## Evidence (`@cursor/sdk@1.0.15`)

A throwaway probe spawned a trivial local run and dumped every stream message + the `wait()` outcome:

- **Stream messages** — only two types across the whole run: `status` `{ type, agent_id, run_id, status }` and `assistant` `{ type, agent_id, run_id, message }`. **No `turnEnded` / `usage` event is emitted by the local runtime.** (The SDK *does* define a `usage: { inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens }` schema, but it rides on a `turnEnded` update that only the **cloud** runtime streams.)
- **Run outcome** (`wait()`) — `{ id, status, result, model, durationMs }`. Carries wall-clock (`durationMs`), no tokens. `agent_id` is **not** here; it is on the stream messages.
- **`analytics.d.ts`** — emit-only outbound telemetry (`trackSdkRunCreated/Completed/SendLatency`, `flushSdkAnalytics`). No read-back API. The event props (`SdkRunCreatedProps`, `SdkRunCompletedProps`, `SdkRunSendLatencyProps`) carry `turn_count`, latency, `end_reason` — **no token counts**.
- **`cloud-api-client` `getRun({agentId,runId}) → V1Run`** — `{ id, agentId, status, createdAt, updatedAt, durationMs?, result?, git? }`. **No tokens.** `RunResultMetadata` and `executor-types.d.ts` have zero token/usage/cost fields. (Also a cloud-agent query; a local run is not necessarily registered there.)

## Decision (re-route)

Proceed on **option (d)**:

- Capture `durationMs` (wall-clock) from the run outcome → the primary Tier-2 efficiency metric for local runs.
- `tokens` stays `null` for local runs, with an explicit manifest note + a documented SDK limitation (consumption gotcha). Not a bug in our extraction — there is nothing to extract.
- A future token source must come from **outside** the SDK (a Cursor admin/usage API, or CLI-internal telemetry). Out of scope for the fix slice.

Companion clean fixes (same slice): capture `agent_id` from the stream `status` message; scope `collect-run` to traces emitted *during* the run (exclude baseline-committed traces).
18 changes: 18 additions & 0 deletions skills-contrib/drive-judge-harness/KNOWN-ISSUES.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,21 @@ The token-usage signal this harness needs comes from `TurnEndedUpdate.usage`, wh
- read the per-turn `usage` field through a small, explicitly-bounded structural view in `sdk-adapter.ts` (guarded at runtime; no bare casts) rather than a fabricated full mirror of the SDK's types.

When upstream ships resolvable types, replace that structural view with the real `TurnEndedUpdate` import and delete the workaround.

## 2. The local runtime emits no token-usage signal at all

Distinct from (and more fundamental than) the type-resolution gap above: even at **runtime**, the `@cursor/sdk` *local* runtime never emits a usage signal, so there is nothing to read regardless of types.

Confirmed by a probe (spike `projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md`) against `@cursor/sdk@1.0.15`:
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

- The local `run.stream()` yields only `status` and `assistant` messages — **no `turnEnded`/`usage` event** (that update is streamed only by the *cloud* runtime).
- The `run.wait()` outcome (`{ id, status, result, model, durationMs }`) carries wall-clock but **no tokens**.
- The cloud `getRun → V1Run` (`{ id, agentId, status, createdAt, updatedAt, durationMs?, result?, git? }`), `RunResultMetadata`, and the `analytics` surface (emit-only `trackSdkRun*`; props carry `turn_count`/latency/`end_reason`) all carry **no token counts**.

### Impact on this harness

For local runs, `tokens` is `null` (with a manifest note), and **`wall_clock_ms` (the outcome's `durationMs`) is the primary efficiency metric.** `accumulateUsage` remains wired, so usage flows automatically if a cloud run (which does stream `turnEnded`) is used, or once a non-SDK local token source exists.

### Suggested fix (upstream)

Stream `turnEnded` (with `usage`) from the local runtime as the cloud runtime already does, or expose per-run token counts on the run outcome / a queryable usage API.
13 changes: 13 additions & 0 deletions skills-contrib/drive-judge-harness/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,19 @@ trace via the emitter. The spawned orchestrator self-instruments its Drive
methodology events into `--trace-file` via `drive-record-traces`; the harness
owns only the token manifest.

**Tokens are unavailable for local runs.** The `@cursor/sdk` *local* runtime
emits no per-turn usage events at all — no `turnEnded`, and nothing in the run
outcome, the `getRun`/`V1Run` cloud query, or the `analytics` surface carries
token counts (confirmed by the spike at
`projects/drive-judge-harness/spikes/2026-05-31-sdk-token-usage-retrieval.md`,
and see KNOWN-ISSUES.md). So `tokens` is honestly `null` for local runs, with a
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated
note recorded on the manifest. **Wall-clock (`wall_clock_ms`, from the run
outcome's `durationMs`) is therefore the primary Tier-2 efficiency metric.** A
future token source would have to come from outside the SDK (a Cursor
admin/usage API, or CLI-internal telemetry); `accumulateUsage` stays wired so the
signal flows automatically if the cloud runtime (which *does* stream usage) is
used, or once a local source exists.

## The LLM judge (`judge/`)

A bespoke-minimal grader that turns the run's artifacts (diff + golden
Expand Down
Loading
Loading