TML-2755: Pin the skill bundle under test (run-setup: prepare / collect / run-arm) by wmadden-electric · Pull Request #656 · prisma/prisma-next

wmadden-electric · 2026-05-31T13:03:09Z

At a glance

A Drive run's quality is a function of the skill version that drove it — so to measure a skill change, the run input must include which skills ran. This slice makes the skill bundle a first-class, pinned run input alongside model and (brief, base). The harness can now isolate a checkout, inject a specified skill bundle, run the orchestrator there, and collect the run's trace + agent-only diff:

pnpm drive:run-arm \
  --case projects/drive-judge-harness/assets/golden/i12-halt-storage-assumption \
  --model composer-2.5 \
  --base-ref main --skill-bundle-ref main \
  --run-dir /tmp/arm-1 --live

prepare-run isolates + injects + materializes; run-one-brief spawns the orchestrator in that checkout; collect-run harvests the emitted trace and the agent's diff. The run manifest records base_sha, skill_bundle_sha, run_dir, the collected trace paths, and the diff stat.

The decision

The materialized skill trees (.cursor/, .claude/, .agents/skills/) are gitignored, so a clean checkout has no skills until something runs the prepare hook. That means every meaningful run needs a setup step — even one against current main — not just A/B runs. We make that step explicit and reproducible:

A skill bundle is a git ref, not an ad-hoc file copy — a commit/branch/tag whose canonical homes (skills-contrib/, .agents/rules/, AGENTS.md/CLAUDE.md) define it. Materialization reuses the repo's own prepare hook (skills add + sync-agent-rules); the harness does not reinvent it.
repo-under-test and skill-bundle-under-test are named apart even though they're the same repo today, so the eventual move of the drive skills to their own host repo costs no rework.
The agent's diff is cut at a baseline commit. prepare-run finalizes with a git commit --allow-empty after overlay + materialize; collect-run diffs against that commit, so the injected skill overlay never pollutes the agent's changes. (The --allow-empty matters: when base == bundle the overlay is byte-identical and stages nothing — this was caught and fixed in review.)
Traces are collected post-hoc by globbing the checkout for schema-valid *.jsonl (matched by orchestrator_agent_id, else newest), leaving emit.ts and the emission protocol untouched.

This is the run-production half of the always-anticipated experiment-engine split (project plan § Sequencing rationale): it unblocks live corpus generation on its own and is the prerequisite for the k=N A/B engine (TML-2737), where an arm is exactly (brief+base, model, skill-bundle) with one axis varied.

What's in the diff

New: prepare-run.ts, collect-run.ts, run-arm.ts (+ 4 test suites). Additive RunManifest fields. A cwd thread-through in run-one-brief.ts / sdk-adapter.ts (the spawn was hard-pinned to process.cwd()).
Planning: the slice spec + plan; design-notes.md § Run setup; plan.md records the executed split (now 5 slices, with the project-boundary rationale).
Invariant preserved: @cursor/sdk is still reached only via sdk-adapter.ts's dynamic import on the --live path — none of the new modules or their tests import it. Everything is exercised with the SDK absent and no CURSOR_API_KEY (git + fs + mocked materialize/agent).

Gates

node --test 570/570 · pnpm lint:deps pass · pnpm lint:casts pass (delta=0) · pnpm typecheck unchanged (pre-existing target-postgres failures only; skills-contrib is not in the typecheck graph).

Scope (deliberately out)

The k=N A/B loop, cross-run aggregation, dashboard, and CI regression gate (TML-2737); the env-pinned trace destination (recorded escape hatch); admitting @cursor/sdk into the lockfile + an actual live run (operator-gated).

Linear: TML-2755 (blocks TML-2737)

Summary by CodeRabbit

New Features
- Introduced pinned skill-bundle pipeline for reproducible run workflows with automated trace metadata collection and git diff statistics reporting
- Enhanced run manifest enrichment with baseline commit tracking and materialization status
Documentation
- Expanded harness documentation with detailed guides on run preparation, artifact collection, and CLI command examples for dry-run and live execution modes
Tests
- Added comprehensive test coverage for trace collection, run preparation, and end-to-end workflow validation

Capture the skill-bundle-under-test as a first-class pinned run input (design-notes § Run setup) and decompose the run-production capability into the run-setup slice: prepare-run (isolate + inject + materialize + baseline commit), collect-run (post-hoc trace/diff harvest), and a run-arm wrapper. This is the run-production half of the always-anticipated experiment-engine split; it lands ahead of the A/B loop and unblocks live corpus generation. (TML-2755, blocks TML-2737) Signed-off-by: Will Madden <madden@prisma.io>

…n-arm) Implements the run-setup slice (TML-2755) so the skill bundle under test becomes a first-class, pinned run input: - prepare-run.ts: isolates a detached git worktree at baseRef, overlays the skill bundle's canonical home dirs (skills-contrib, .agents/rules, AGENTS.md, CLAUDE.md) via git archive | tar -x, materializes via the repo's own prepare hook, and finalizes a baseline commit so the agent's diff is cleanly separable from the injected skills. - collect-run.ts: post-hoc collection — globs *.jsonl under runDir, keeps those whose first line validates against Slice1TraceEvent, matches by orchestrator_agent_id (falling back to newest), and computes diff/diffStat against the baseline commit (not baseRef), so injected skill files are excluded from the collected diff. - run-arm.ts: thin CLI + runArm(config, deps) composing the pipeline prepareRun to runOneBrief({runDir}) to collectRun and writing an enriched manifest that carries base_ref, base_sha, skill_bundle_ref, skill_bundle_sha, run_dir, collected_trace_paths, diff_stat, and materialized. - run-one-brief.ts / sdk-adapter.ts: thread runDir through as cwd so the orchestrator spawns inside the prepared checkout rather than the harness process working directory. - manifest.ts: additive optional fields for the pinned-input metadata. Tests (568 total, all passing): - prepare-run.test.ts (8 tests) with real git fixture and mocked materialize - collect-run.test.ts (9 tests) including required diff-exclusion proof - run-one-brief-cwd.test.ts (2 tests) asserting cwd === runDir is threaded - run-arm.test.ts (4 tests) pipeline composition with manifest round-trip No @cursor/sdk import on any of these paths. All gates green: - pnpm lint:deps: pass - pnpm lint:casts: pass (delta=0) - node --test all new + existing harness suites: 568 pass, 0 fail - pnpm typecheck: pre-existing failures unchanged (target-postgres/mongo/cli only; skills-contrib is not a turbo package) Signed-off-by: Will Madden <madden@prisma.io>

When baseRef === skillBundle.ref (prepare a run against current main with the current skill bundle), the overlay is byte-identical to the base checkout and the materialized trees are gitignored, so git add -A stages nothing and the baseline commit exited non-zero. Use git commit --allow-empty so prepareCommit is always a real commit usable as the collect-run diff cut point, regardless of whether the overlay introduced changes. Adds two tests: the no-op-overlay case yields a 40-char prepareCommit without throwing, and an end-to-end prepareRun + collectRun proving the cut point holds with an empty overlay (zero changes before agent work, picks up a post-baseline change). Signed-off-by: Will Madden <madden@prisma.io>

…satisfied) Signed-off-by: Will Madden <madden@prisma.io>

coderabbitai · 2026-05-31T13:04:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8f60aae3-b368-4189-83b9-0634b3bfff3f

📥 Commits

Reviewing files that changed from the base of the PR and between 2e120e5 and 3ff3e94.

⛔ Files ignored due to path filters (5)

projects/drive-judge-harness/design-notes.md is excluded by !projects/**
projects/drive-judge-harness/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/run-setup/plan.md is excluded by !projects/**
projects/drive-judge-harness/slices/run-setup/spec.md is excluded by !projects/**
projects/drive-judge-harness/slices/run-setup/trace.jsonl is excluded by !projects/**

📒 Files selected for processing (13)

package.json
skills-contrib/drive-judge-harness/SKILL.md
skills-contrib/drive-judge-harness/collect-run.ts
skills-contrib/drive-judge-harness/manifest.ts
skills-contrib/drive-judge-harness/prepare-run.ts
skills-contrib/drive-judge-harness/run-arm.ts
skills-contrib/drive-judge-harness/run-one-brief.ts
skills-contrib/drive-judge-harness/sdk-adapter.ts
skills-contrib/drive-judge-harness/test/collect-run.test.ts
skills-contrib/drive-judge-harness/test/prepare-run.test.ts
skills-contrib/drive-judge-harness/test/run-arm.test.ts
skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts
skills-contrib/drive-judge-harness/test/run-one-brief.test.ts

📝 Walkthrough

Walkthrough

This PR adds a complete "prepare → run → collect → manifest" workflow to the drive-judge-harness. It introduces modules for isolated run directory preparation with skill-bundle overlay, trace/diff collection, and orchestration, while integrating configurable working directories into the existing run-one-brief harness through a new cwd parameter.

Changes

Drive Judge Harness: Run-Arm Workflow and Integration

Layer / File(s)	Summary
Manifest data model extensions `skills-contrib/drive-judge-harness/manifest.ts`	Extends `RunManifest` with fields for pinned input refs/SHAs (`base_ref`, `skill_bundle_ref`, etc.), run outputs (`collected_trace_paths`, `diff_stat`), and `materialized` flag. Introduces `DiffStat` type for file and insertion/deletion counts.
Run preparation pipeline `skills-contrib/drive-judge-harness/prepare-run.ts`, `skills-contrib/drive-judge-harness/test/prepare-run.test.ts`	Prepares an isolated run directory by computing commit SHAs via `git rev-parse`, creating a detached worktree, overlaying skill-bundle files using `git archive` and `tar`, optionally materializing dependencies (`pnpm install`), and creating a baseline commit. Tests validate SHA resolution, file overlay with base file preservation, 40-character commit generation, and materialization status tracking.
Run collection and diff analysis `skills-contrib/drive-judge-harness/collect-run.ts`, `skills-contrib/drive-judge-harness/test/collect-run.test.ts`	Collects trace metadata from JSONL files with schema validation, selects traces by `orchestrator_agent_id` or newest mtime, and computes diffs/statistics relative to the baseline commit using `git diff --numstat`. Tests cover trace discovery, schema conformance, agent ID matching, and diff exclusion of injected skill files (`skill.md`, `AGENTS.md`).
Run orchestration and CLI `skills-contrib/drive-judge-harness/run-arm.ts`, `skills-contrib/drive-judge-harness/test/run-arm.test.ts`	Orchestrates the full workflow: prepares, runs one brief, collects results, and writes an enriched manifest. Includes CLI with flag parsing (`--repo`, `--base-ref`, `--bundle`, `--run-dir`, etc.), defaults computation, and exit code mapping. Tests validate manifest enrichment with pinned metadata, round-trip persistence, and dry-run status preservation.
Working directory integration with run-one-brief `skills-contrib/drive-judge-harness/run-one-brief.ts`, `skills-contrib/drive-judge-harness/sdk-adapter.ts`, `skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts`, `skills-contrib/drive-judge-harness/test/run-one-brief.test.ts`	Adds `runDir` field to `RunOneBriefConfig` and `cwd` parameter to `CreateAgent` callback. Live-path invocation passes `cwd: config.runDir` to agent creation, while `createCursorAgent` passes `cwd` to `Agent.create` instead of hardcoded `process.cwd()`. New test suite verifies cwd threading in live mode; updated existing tests wire `runDir` parameter across all invocations.
Documentation and test wiring `skills-contrib/drive-judge-harness/SKILL.md`, `package.json`	Updates SKILL.md to document run-arm workflow scope, pinned skill-bundle pipeline steps, and CLI usage examples for both dry-run and live execution with `CURSOR_API_KEY` requirement. Updates `package.json` test script to include new harness test files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

prisma/prisma-next#641: Extends run-one-brief with runDir/cwd support in the same harness module with near-identical signature changes.
prisma/prisma-next#613: Both expand package.json test:scripts to include additional skills-contrib test files.

Suggested reviewers

aqrln

Poem

🐰 Behold the judge who runs so fleet,
Prepares the ground, collects the beat,
With traces traced and diffs so clean,
A harness born to orchestrate the scene.
Run, collect, and write the scroll—
The rabbit's work now makes us whole! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main objective of the PR: pinning the skill bundle as a run input and introducing the core pipeline components (prepare, collect, run-arm) for this feature.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tml-2755-run-setup

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

pkg-pr-new · 2026-05-31T13:05:27Z

Open in StackBlitz

@prisma-next/extension-author-tools

npm i https://pkg.pr.new/@prisma-next/extension-author-tools@656

@prisma-next/mongo-runtime

npm i https://pkg.pr.new/@prisma-next/mongo-runtime@656

@prisma-next/family-mongo

npm i https://pkg.pr.new/@prisma-next/family-mongo@656

@prisma-next/sql-runtime

npm i https://pkg.pr.new/@prisma-next/sql-runtime@656

@prisma-next/family-sql

npm i https://pkg.pr.new/@prisma-next/family-sql@656

@prisma-next/extension-arktype-json

npm i https://pkg.pr.new/@prisma-next/extension-arktype-json@656

@prisma-next/middleware-cache

npm i https://pkg.pr.new/@prisma-next/middleware-cache@656

@prisma-next/mongo

npm i https://pkg.pr.new/@prisma-next/mongo@656

@prisma-next/extension-paradedb

npm i https://pkg.pr.new/@prisma-next/extension-paradedb@656

@prisma-next/extension-pgvector

npm i https://pkg.pr.new/@prisma-next/extension-pgvector@656

@prisma-next/extension-postgis

npm i https://pkg.pr.new/@prisma-next/extension-postgis@656

@prisma-next/postgres

npm i https://pkg.pr.new/@prisma-next/postgres@656

@prisma-next/sql-orm-client

npm i https://pkg.pr.new/@prisma-next/sql-orm-client@656

@prisma-next/sqlite

npm i https://pkg.pr.new/@prisma-next/sqlite@656

@prisma-next/target-mongo

npm i https://pkg.pr.new/@prisma-next/target-mongo@656

@prisma-next/adapter-mongo

npm i https://pkg.pr.new/@prisma-next/adapter-mongo@656

@prisma-next/driver-mongo

npm i https://pkg.pr.new/@prisma-next/driver-mongo@656

@prisma-next/contract

npm i https://pkg.pr.new/@prisma-next/contract@656

@prisma-next/utils

npm i https://pkg.pr.new/@prisma-next/utils@656

@prisma-next/config

npm i https://pkg.pr.new/@prisma-next/config@656

@prisma-next/errors

npm i https://pkg.pr.new/@prisma-next/errors@656

@prisma-next/framework-components

npm i https://pkg.pr.new/@prisma-next/framework-components@656

@prisma-next/operations

npm i https://pkg.pr.new/@prisma-next/operations@656

@prisma-next/ts-render

npm i https://pkg.pr.new/@prisma-next/ts-render@656

@prisma-next/contract-authoring

npm i https://pkg.pr.new/@prisma-next/contract-authoring@656

@prisma-next/ids

npm i https://pkg.pr.new/@prisma-next/ids@656

@prisma-next/psl-parser

npm i https://pkg.pr.new/@prisma-next/psl-parser@656

@prisma-next/psl-printer

npm i https://pkg.pr.new/@prisma-next/psl-printer@656

@prisma-next/cli

npm i https://pkg.pr.new/@prisma-next/cli@656

@prisma-next/cli-telemetry

npm i https://pkg.pr.new/@prisma-next/cli-telemetry@656

@prisma-next/emitter

npm i https://pkg.pr.new/@prisma-next/emitter@656

@prisma-next/migration-tools

npm i https://pkg.pr.new/@prisma-next/migration-tools@656

prisma-next

npm i https://pkg.pr.new/prisma-next@656

@prisma-next/vite-plugin-contract-emit

npm i https://pkg.pr.new/@prisma-next/vite-plugin-contract-emit@656

@prisma-next/mongo-codec

npm i https://pkg.pr.new/@prisma-next/mongo-codec@656

@prisma-next/mongo-contract

npm i https://pkg.pr.new/@prisma-next/mongo-contract@656

@prisma-next/mongo-value

npm i https://pkg.pr.new/@prisma-next/mongo-value@656

@prisma-next/mongo-contract-psl

npm i https://pkg.pr.new/@prisma-next/mongo-contract-psl@656

@prisma-next/mongo-contract-ts

npm i https://pkg.pr.new/@prisma-next/mongo-contract-ts@656

@prisma-next/mongo-emitter

npm i https://pkg.pr.new/@prisma-next/mongo-emitter@656

@prisma-next/mongo-schema-ir

npm i https://pkg.pr.new/@prisma-next/mongo-schema-ir@656

@prisma-next/mongo-query-ast

npm i https://pkg.pr.new/@prisma-next/mongo-query-ast@656

@prisma-next/mongo-orm

npm i https://pkg.pr.new/@prisma-next/mongo-orm@656

@prisma-next/mongo-query-builder

npm i https://pkg.pr.new/@prisma-next/mongo-query-builder@656

@prisma-next/mongo-lowering

npm i https://pkg.pr.new/@prisma-next/mongo-lowering@656

@prisma-next/mongo-wire

npm i https://pkg.pr.new/@prisma-next/mongo-wire@656

@prisma-next/sql-contract

npm i https://pkg.pr.new/@prisma-next/sql-contract@656

@prisma-next/sql-errors

npm i https://pkg.pr.new/@prisma-next/sql-errors@656

@prisma-next/sql-operations

npm i https://pkg.pr.new/@prisma-next/sql-operations@656

@prisma-next/sql-schema-ir

npm i https://pkg.pr.new/@prisma-next/sql-schema-ir@656

@prisma-next/sql-contract-psl

npm i https://pkg.pr.new/@prisma-next/sql-contract-psl@656

@prisma-next/sql-contract-ts

npm i https://pkg.pr.new/@prisma-next/sql-contract-ts@656

@prisma-next/sql-contract-emitter

npm i https://pkg.pr.new/@prisma-next/sql-contract-emitter@656

@prisma-next/sql-lane-query-builder

npm i https://pkg.pr.new/@prisma-next/sql-lane-query-builder@656

@prisma-next/sql-relational-core

npm i https://pkg.pr.new/@prisma-next/sql-relational-core@656

@prisma-next/sql-builder

npm i https://pkg.pr.new/@prisma-next/sql-builder@656

@prisma-next/target-postgres

npm i https://pkg.pr.new/@prisma-next/target-postgres@656

@prisma-next/target-sqlite

npm i https://pkg.pr.new/@prisma-next/target-sqlite@656

@prisma-next/adapter-postgres

npm i https://pkg.pr.new/@prisma-next/adapter-postgres@656

@prisma-next/adapter-sqlite

npm i https://pkg.pr.new/@prisma-next/adapter-sqlite@656

@prisma-next/driver-postgres

npm i https://pkg.pr.new/@prisma-next/driver-postgres@656

@prisma-next/driver-sqlite

npm i https://pkg.pr.new/@prisma-next/driver-sqlite@656

commit: 3ff3e94

github-actions · 2026-05-31T13:05:33Z

size-limit report 📦

Path	Size
postgres / no-emit	135.37 KB (0%)
postgres / emit	125.16 KB (0%)
mongo / no-emit	75.14 KB (0%)
mongo / emit	70.15 KB (0%)

wmadden added 4 commits May 31, 2026 14:30

chore(drive-judge-harness): record run-setup dispatch trace (round 2 …

3ff3e94

…satisfied) Signed-off-by: Will Madden <madden@prisma.io>

wmadden-electric requested a review from a team as a code owner May 31, 2026 13:03

wmadden approved these changes May 31, 2026

View reviewed changes

wmadden-electric merged commit f779815 into main May 31, 2026
21 checks passed

wmadden-electric deleted the tml-2755-run-setup branch May 31, 2026 15:03

wmadden-electric mentioned this pull request May 31, 2026

TML-2759: run the harness on the Claude Agent SDK by default (Cursor decoupled) + faithful run recording #657

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TML-2755: Pin the skill bundle under test (run-setup: prepare / collect / run-arm)#656

TML-2755: Pin the skill bundle under test (run-setup: prepare / collect / run-arm)#656
wmadden-electric merged 4 commits into
mainfrom
tml-2755-run-setup

wmadden-electric commented May 31, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 31, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

pkg-pr-new Bot commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wmadden-electric commented May 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

At a glance

The decision

What's in the diff

Gates

Scope (deliberately out)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

pkg-pr-new Bot commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

size-limit report 📦

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wmadden-electric commented May 31, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 31, 2026 •

edited

Loading