Skip to content

TML-2755: Pin the skill bundle under test (run-setup: prepare / collect / run-arm)#656

Merged
wmadden-electric merged 4 commits into
mainfrom
tml-2755-run-setup
May 31, 2026
Merged

TML-2755: Pin the skill bundle under test (run-setup: prepare / collect / run-arm)#656
wmadden-electric merged 4 commits into
mainfrom
tml-2755-run-setup

Conversation

@wmadden-electric
Copy link
Copy Markdown
Contributor

@wmadden-electric wmadden-electric commented May 31, 2026

At a glance

A Drive run's quality is a function of the skill version that drove it — so to measure a skill change, the run input must include which skills ran. This slice makes the skill bundle a first-class, pinned run input alongside model and (brief, base). The harness can now isolate a checkout, inject a specified skill bundle, run the orchestrator there, and collect the run's trace + agent-only diff:

pnpm drive:run-arm \
  --case projects/drive-judge-harness/assets/golden/i12-halt-storage-assumption \
  --model composer-2.5 \
  --base-ref main --skill-bundle-ref main \
  --run-dir /tmp/arm-1 --live

prepare-run isolates + injects + materializes; run-one-brief spawns the orchestrator in that checkout; collect-run harvests the emitted trace and the agent's diff. The run manifest records base_sha, skill_bundle_sha, run_dir, the collected trace paths, and the diff stat.

The decision

The materialized skill trees (.cursor/, .claude/, .agents/skills/) are gitignored, so a clean checkout has no skills until something runs the prepare hook. That means every meaningful run needs a setup step — even one against current main — not just A/B runs. We make that step explicit and reproducible:

  • A skill bundle is a git ref, not an ad-hoc file copy — a commit/branch/tag whose canonical homes (skills-contrib/, .agents/rules/, AGENTS.md/CLAUDE.md) define it. Materialization reuses the repo's own prepare hook (skills add + sync-agent-rules); the harness does not reinvent it.
  • repo-under-test and skill-bundle-under-test are named apart even though they're the same repo today, so the eventual move of the drive skills to their own host repo costs no rework.
  • The agent's diff is cut at a baseline commit. prepare-run finalizes with a git commit --allow-empty after overlay + materialize; collect-run diffs against that commit, so the injected skill overlay never pollutes the agent's changes. (The --allow-empty matters: when base == bundle the overlay is byte-identical and stages nothing — this was caught and fixed in review.)
  • Traces are collected post-hoc by globbing the checkout for schema-valid *.jsonl (matched by orchestrator_agent_id, else newest), leaving emit.ts and the emission protocol untouched.

This is the run-production half of the always-anticipated experiment-engine split (project plan § Sequencing rationale): it unblocks live corpus generation on its own and is the prerequisite for the k=N A/B engine (TML-2737), where an arm is exactly (brief+base, model, skill-bundle) with one axis varied.

What's in the diff

  • New: prepare-run.ts, collect-run.ts, run-arm.ts (+ 4 test suites). Additive RunManifest fields. A cwd thread-through in run-one-brief.ts / sdk-adapter.ts (the spawn was hard-pinned to process.cwd()).
  • Planning: the slice spec + plan; design-notes.md § Run setup; plan.md records the executed split (now 5 slices, with the project-boundary rationale).
  • Invariant preserved: @cursor/sdk is still reached only via sdk-adapter.ts's dynamic import on the --live path — none of the new modules or their tests import it. Everything is exercised with the SDK absent and no CURSOR_API_KEY (git + fs + mocked materialize/agent).

Gates

node --test 570/570 · pnpm lint:deps pass · pnpm lint:casts pass (delta=0) · pnpm typecheck unchanged (pre-existing target-postgres failures only; skills-contrib is not in the typecheck graph).

Scope (deliberately out)

The k=N A/B loop, cross-run aggregation, dashboard, and CI regression gate (TML-2737); the env-pinned trace destination (recorded escape hatch); admitting @cursor/sdk into the lockfile + an actual live run (operator-gated).

Linear: TML-2755 (blocks TML-2737)

Summary by CodeRabbit

  • New Features

    • Introduced pinned skill-bundle pipeline for reproducible run workflows with automated trace metadata collection and git diff statistics reporting
    • Enhanced run manifest enrichment with baseline commit tracking and materialization status
  • Documentation

    • Expanded harness documentation with detailed guides on run preparation, artifact collection, and CLI command examples for dry-run and live execution modes
  • Tests

    • Added comprehensive test coverage for trace collection, run preparation, and end-to-end workflow validation

wmadden added 4 commits May 31, 2026 14:30
Capture the skill-bundle-under-test as a first-class pinned run input
(design-notes § Run setup) and decompose the run-production capability
into the run-setup slice: prepare-run (isolate + inject + materialize +
baseline commit), collect-run (post-hoc trace/diff harvest), and a
run-arm wrapper. This is the run-production half of the always-anticipated
experiment-engine split; it lands ahead of the A/B loop and unblocks live
corpus generation. (TML-2755, blocks TML-2737)

Signed-off-by: Will Madden <madden@prisma.io>
…n-arm)

Implements the run-setup slice (TML-2755) so the skill bundle under test
becomes a first-class, pinned run input:

- prepare-run.ts: isolates a detached git worktree at baseRef, overlays the
  skill bundle's canonical home dirs (skills-contrib, .agents/rules, AGENTS.md,
  CLAUDE.md) via git archive | tar -x, materializes via the repo's own prepare
  hook, and finalizes a baseline commit so the agent's diff is cleanly separable
  from the injected skills.

- collect-run.ts: post-hoc collection — globs *.jsonl under runDir, keeps those
  whose first line validates against Slice1TraceEvent, matches by
  orchestrator_agent_id (falling back to newest), and computes diff/diffStat
  against the baseline commit (not baseRef), so injected skill files are excluded
  from the collected diff.

- run-arm.ts: thin CLI + runArm(config, deps) composing the pipeline prepareRun
  to runOneBrief({runDir}) to collectRun and writing an enriched manifest that
  carries base_ref, base_sha, skill_bundle_ref, skill_bundle_sha, run_dir,
  collected_trace_paths, diff_stat, and materialized.

- run-one-brief.ts / sdk-adapter.ts: thread runDir through as cwd so the
  orchestrator spawns inside the prepared checkout rather than the harness
  process working directory.

- manifest.ts: additive optional fields for the pinned-input metadata.

Tests (568 total, all passing):
- prepare-run.test.ts (8 tests) with real git fixture and mocked materialize
- collect-run.test.ts (9 tests) including required diff-exclusion proof
- run-one-brief-cwd.test.ts (2 tests) asserting cwd === runDir is threaded
- run-arm.test.ts (4 tests) pipeline composition with manifest round-trip

No @cursor/sdk import on any of these paths. All gates green:
- pnpm lint:deps: pass
- pnpm lint:casts: pass (delta=0)
- node --test all new + existing harness suites: 568 pass, 0 fail
- pnpm typecheck: pre-existing failures unchanged (target-postgres/mongo/cli only;
  skills-contrib is not a turbo package)

Signed-off-by: Will Madden <madden@prisma.io>
When baseRef === skillBundle.ref (prepare a run against current main with the
current skill bundle), the overlay is byte-identical to the base checkout and
the materialized trees are gitignored, so git add -A stages nothing and the
baseline commit exited non-zero. Use git commit --allow-empty so prepareCommit
is always a real commit usable as the collect-run diff cut point, regardless of
whether the overlay introduced changes.

Adds two tests: the no-op-overlay case yields a 40-char prepareCommit without
throwing, and an end-to-end prepareRun + collectRun proving the cut point holds
with an empty overlay (zero changes before agent work, picks up a post-baseline
change).

Signed-off-by: Will Madden <madden@prisma.io>
…satisfied)

Signed-off-by: Will Madden <madden@prisma.io>
@wmadden-electric wmadden-electric requested a review from a team as a code owner May 31, 2026 13:03
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8f60aae3-b368-4189-83b9-0634b3bfff3f

📥 Commits

Reviewing files that changed from the base of the PR and between 2e120e5 and 3ff3e94.

⛔ Files ignored due to path filters (5)
  • projects/drive-judge-harness/design-notes.md is excluded by !projects/**
  • projects/drive-judge-harness/plan.md is excluded by !projects/**
  • projects/drive-judge-harness/slices/run-setup/plan.md is excluded by !projects/**
  • projects/drive-judge-harness/slices/run-setup/spec.md is excluded by !projects/**
  • projects/drive-judge-harness/slices/run-setup/trace.jsonl is excluded by !projects/**
📒 Files selected for processing (13)
  • package.json
  • skills-contrib/drive-judge-harness/SKILL.md
  • skills-contrib/drive-judge-harness/collect-run.ts
  • skills-contrib/drive-judge-harness/manifest.ts
  • skills-contrib/drive-judge-harness/prepare-run.ts
  • skills-contrib/drive-judge-harness/run-arm.ts
  • skills-contrib/drive-judge-harness/run-one-brief.ts
  • skills-contrib/drive-judge-harness/sdk-adapter.ts
  • skills-contrib/drive-judge-harness/test/collect-run.test.ts
  • skills-contrib/drive-judge-harness/test/prepare-run.test.ts
  • skills-contrib/drive-judge-harness/test/run-arm.test.ts
  • skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts
  • skills-contrib/drive-judge-harness/test/run-one-brief.test.ts

📝 Walkthrough

Walkthrough

This PR adds a complete "prepare → run → collect → manifest" workflow to the drive-judge-harness. It introduces modules for isolated run directory preparation with skill-bundle overlay, trace/diff collection, and orchestration, while integrating configurable working directories into the existing run-one-brief harness through a new cwd parameter.

Changes

Drive Judge Harness: Run-Arm Workflow and Integration

Layer / File(s) Summary
Manifest data model extensions
skills-contrib/drive-judge-harness/manifest.ts
Extends RunManifest with fields for pinned input refs/SHAs (base_ref, skill_bundle_ref, etc.), run outputs (collected_trace_paths, diff_stat), and materialized flag. Introduces DiffStat type for file and insertion/deletion counts.
Run preparation pipeline
skills-contrib/drive-judge-harness/prepare-run.ts, skills-contrib/drive-judge-harness/test/prepare-run.test.ts
Prepares an isolated run directory by computing commit SHAs via git rev-parse, creating a detached worktree, overlaying skill-bundle files using git archive and tar, optionally materializing dependencies (pnpm install), and creating a baseline commit. Tests validate SHA resolution, file overlay with base file preservation, 40-character commit generation, and materialization status tracking.
Run collection and diff analysis
skills-contrib/drive-judge-harness/collect-run.ts, skills-contrib/drive-judge-harness/test/collect-run.test.ts
Collects trace metadata from JSONL files with schema validation, selects traces by orchestrator_agent_id or newest mtime, and computes diffs/statistics relative to the baseline commit using git diff --numstat. Tests cover trace discovery, schema conformance, agent ID matching, and diff exclusion of injected skill files (skill.md, AGENTS.md).
Run orchestration and CLI
skills-contrib/drive-judge-harness/run-arm.ts, skills-contrib/drive-judge-harness/test/run-arm.test.ts
Orchestrates the full workflow: prepares, runs one brief, collects results, and writes an enriched manifest. Includes CLI with flag parsing (--repo, --base-ref, --bundle, --run-dir, etc.), defaults computation, and exit code mapping. Tests validate manifest enrichment with pinned metadata, round-trip persistence, and dry-run status preservation.
Working directory integration with run-one-brief
skills-contrib/drive-judge-harness/run-one-brief.ts, skills-contrib/drive-judge-harness/sdk-adapter.ts, skills-contrib/drive-judge-harness/test/run-one-brief-cwd.test.ts, skills-contrib/drive-judge-harness/test/run-one-brief.test.ts
Adds runDir field to RunOneBriefConfig and cwd parameter to CreateAgent callback. Live-path invocation passes cwd: config.runDir to agent creation, while createCursorAgent passes cwd to Agent.create instead of hardcoded process.cwd(). New test suite verifies cwd threading in live mode; updated existing tests wire runDir parameter across all invocations.
Documentation and test wiring
skills-contrib/drive-judge-harness/SKILL.md, package.json
Updates SKILL.md to document run-arm workflow scope, pinned skill-bundle pipeline steps, and CLI usage examples for both dry-run and live execution with CURSOR_API_KEY requirement. Updates package.json test script to include new harness test files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • prisma/prisma-next#641: Extends run-one-brief with runDir/cwd support in the same harness module with near-identical signature changes.
  • prisma/prisma-next#613: Both expand package.json test:scripts to include additional skills-contrib test files.

Suggested reviewers

  • aqrln

Poem

🐰 Behold the judge who runs so fleet,
Prepares the ground, collects the beat,
With traces traced and diffs so clean,
A harness born to orchestrate the scene.
Run, collect, and write the scroll—
The rabbit's work now makes us whole! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the PR: pinning the skill bundle as a run input and introducing the core pipeline components (prepare, collect, run-arm) for this feature.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tml-2755-run-setup

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 31, 2026

Open in StackBlitz

@prisma-next/extension-author-tools

npm i https://pkg.pr.new/@prisma-next/extension-author-tools@656

@prisma-next/mongo-runtime

npm i https://pkg.pr.new/@prisma-next/mongo-runtime@656

@prisma-next/family-mongo

npm i https://pkg.pr.new/@prisma-next/family-mongo@656

@prisma-next/sql-runtime

npm i https://pkg.pr.new/@prisma-next/sql-runtime@656

@prisma-next/family-sql

npm i https://pkg.pr.new/@prisma-next/family-sql@656

@prisma-next/extension-arktype-json

npm i https://pkg.pr.new/@prisma-next/extension-arktype-json@656

@prisma-next/middleware-cache

npm i https://pkg.pr.new/@prisma-next/middleware-cache@656

@prisma-next/mongo

npm i https://pkg.pr.new/@prisma-next/mongo@656

@prisma-next/extension-paradedb

npm i https://pkg.pr.new/@prisma-next/extension-paradedb@656

@prisma-next/extension-pgvector

npm i https://pkg.pr.new/@prisma-next/extension-pgvector@656

@prisma-next/extension-postgis

npm i https://pkg.pr.new/@prisma-next/extension-postgis@656

@prisma-next/postgres

npm i https://pkg.pr.new/@prisma-next/postgres@656

@prisma-next/sql-orm-client

npm i https://pkg.pr.new/@prisma-next/sql-orm-client@656

@prisma-next/sqlite

npm i https://pkg.pr.new/@prisma-next/sqlite@656

@prisma-next/target-mongo

npm i https://pkg.pr.new/@prisma-next/target-mongo@656

@prisma-next/adapter-mongo

npm i https://pkg.pr.new/@prisma-next/adapter-mongo@656

@prisma-next/driver-mongo

npm i https://pkg.pr.new/@prisma-next/driver-mongo@656

@prisma-next/contract

npm i https://pkg.pr.new/@prisma-next/contract@656

@prisma-next/utils

npm i https://pkg.pr.new/@prisma-next/utils@656

@prisma-next/config

npm i https://pkg.pr.new/@prisma-next/config@656

@prisma-next/errors

npm i https://pkg.pr.new/@prisma-next/errors@656

@prisma-next/framework-components

npm i https://pkg.pr.new/@prisma-next/framework-components@656

@prisma-next/operations

npm i https://pkg.pr.new/@prisma-next/operations@656

@prisma-next/ts-render

npm i https://pkg.pr.new/@prisma-next/ts-render@656

@prisma-next/contract-authoring

npm i https://pkg.pr.new/@prisma-next/contract-authoring@656

@prisma-next/ids

npm i https://pkg.pr.new/@prisma-next/ids@656

@prisma-next/psl-parser

npm i https://pkg.pr.new/@prisma-next/psl-parser@656

@prisma-next/psl-printer

npm i https://pkg.pr.new/@prisma-next/psl-printer@656

@prisma-next/cli

npm i https://pkg.pr.new/@prisma-next/cli@656

@prisma-next/cli-telemetry

npm i https://pkg.pr.new/@prisma-next/cli-telemetry@656

@prisma-next/emitter

npm i https://pkg.pr.new/@prisma-next/emitter@656

@prisma-next/migration-tools

npm i https://pkg.pr.new/@prisma-next/migration-tools@656

prisma-next

npm i https://pkg.pr.new/prisma-next@656

@prisma-next/vite-plugin-contract-emit

npm i https://pkg.pr.new/@prisma-next/vite-plugin-contract-emit@656

@prisma-next/mongo-codec

npm i https://pkg.pr.new/@prisma-next/mongo-codec@656

@prisma-next/mongo-contract

npm i https://pkg.pr.new/@prisma-next/mongo-contract@656

@prisma-next/mongo-value

npm i https://pkg.pr.new/@prisma-next/mongo-value@656

@prisma-next/mongo-contract-psl

npm i https://pkg.pr.new/@prisma-next/mongo-contract-psl@656

@prisma-next/mongo-contract-ts

npm i https://pkg.pr.new/@prisma-next/mongo-contract-ts@656

@prisma-next/mongo-emitter

npm i https://pkg.pr.new/@prisma-next/mongo-emitter@656

@prisma-next/mongo-schema-ir

npm i https://pkg.pr.new/@prisma-next/mongo-schema-ir@656

@prisma-next/mongo-query-ast

npm i https://pkg.pr.new/@prisma-next/mongo-query-ast@656

@prisma-next/mongo-orm

npm i https://pkg.pr.new/@prisma-next/mongo-orm@656

@prisma-next/mongo-query-builder

npm i https://pkg.pr.new/@prisma-next/mongo-query-builder@656

@prisma-next/mongo-lowering

npm i https://pkg.pr.new/@prisma-next/mongo-lowering@656

@prisma-next/mongo-wire

npm i https://pkg.pr.new/@prisma-next/mongo-wire@656

@prisma-next/sql-contract

npm i https://pkg.pr.new/@prisma-next/sql-contract@656

@prisma-next/sql-errors

npm i https://pkg.pr.new/@prisma-next/sql-errors@656

@prisma-next/sql-operations

npm i https://pkg.pr.new/@prisma-next/sql-operations@656

@prisma-next/sql-schema-ir

npm i https://pkg.pr.new/@prisma-next/sql-schema-ir@656

@prisma-next/sql-contract-psl

npm i https://pkg.pr.new/@prisma-next/sql-contract-psl@656

@prisma-next/sql-contract-ts

npm i https://pkg.pr.new/@prisma-next/sql-contract-ts@656

@prisma-next/sql-contract-emitter

npm i https://pkg.pr.new/@prisma-next/sql-contract-emitter@656

@prisma-next/sql-lane-query-builder

npm i https://pkg.pr.new/@prisma-next/sql-lane-query-builder@656

@prisma-next/sql-relational-core

npm i https://pkg.pr.new/@prisma-next/sql-relational-core@656

@prisma-next/sql-builder

npm i https://pkg.pr.new/@prisma-next/sql-builder@656

@prisma-next/target-postgres

npm i https://pkg.pr.new/@prisma-next/target-postgres@656

@prisma-next/target-sqlite

npm i https://pkg.pr.new/@prisma-next/target-sqlite@656

@prisma-next/adapter-postgres

npm i https://pkg.pr.new/@prisma-next/adapter-postgres@656

@prisma-next/adapter-sqlite

npm i https://pkg.pr.new/@prisma-next/adapter-sqlite@656

@prisma-next/driver-postgres

npm i https://pkg.pr.new/@prisma-next/driver-postgres@656

@prisma-next/driver-sqlite

npm i https://pkg.pr.new/@prisma-next/driver-sqlite@656

commit: 3ff3e94

@github-actions
Copy link
Copy Markdown

size-limit report 📦

Path Size
postgres / no-emit 135.37 KB (0%)
postgres / emit 125.16 KB (0%)
mongo / no-emit 75.14 KB (0%)
mongo / emit 70.15 KB (0%)

@wmadden-electric wmadden-electric merged commit f779815 into main May 31, 2026
21 checks passed
@wmadden-electric wmadden-electric deleted the tml-2755-run-setup branch May 31, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants