TML-2755: Pin the skill bundle under test (run-setup: prepare / collect / run-arm)#656
Conversation
Capture the skill-bundle-under-test as a first-class pinned run input (design-notes § Run setup) and decompose the run-production capability into the run-setup slice: prepare-run (isolate + inject + materialize + baseline commit), collect-run (post-hoc trace/diff harvest), and a run-arm wrapper. This is the run-production half of the always-anticipated experiment-engine split; it lands ahead of the A/B loop and unblocks live corpus generation. (TML-2755, blocks TML-2737) Signed-off-by: Will Madden <madden@prisma.io>
…n-arm)
Implements the run-setup slice (TML-2755) so the skill bundle under test
becomes a first-class, pinned run input:
- prepare-run.ts: isolates a detached git worktree at baseRef, overlays the
skill bundle's canonical home dirs (skills-contrib, .agents/rules, AGENTS.md,
CLAUDE.md) via git archive | tar -x, materializes via the repo's own prepare
hook, and finalizes a baseline commit so the agent's diff is cleanly separable
from the injected skills.
- collect-run.ts: post-hoc collection — globs *.jsonl under runDir, keeps those
whose first line validates against Slice1TraceEvent, matches by
orchestrator_agent_id (falling back to newest), and computes diff/diffStat
against the baseline commit (not baseRef), so injected skill files are excluded
from the collected diff.
- run-arm.ts: thin CLI + runArm(config, deps) composing the pipeline prepareRun
to runOneBrief({runDir}) to collectRun and writing an enriched manifest that
carries base_ref, base_sha, skill_bundle_ref, skill_bundle_sha, run_dir,
collected_trace_paths, diff_stat, and materialized.
- run-one-brief.ts / sdk-adapter.ts: thread runDir through as cwd so the
orchestrator spawns inside the prepared checkout rather than the harness
process working directory.
- manifest.ts: additive optional fields for the pinned-input metadata.
Tests (568 total, all passing):
- prepare-run.test.ts (8 tests) with real git fixture and mocked materialize
- collect-run.test.ts (9 tests) including required diff-exclusion proof
- run-one-brief-cwd.test.ts (2 tests) asserting cwd === runDir is threaded
- run-arm.test.ts (4 tests) pipeline composition with manifest round-trip
No @cursor/sdk import on any of these paths. All gates green:
- pnpm lint:deps: pass
- pnpm lint:casts: pass (delta=0)
- node --test all new + existing harness suites: 568 pass, 0 fail
- pnpm typecheck: pre-existing failures unchanged (target-postgres/mongo/cli only;
skills-contrib is not a turbo package)
Signed-off-by: Will Madden <madden@prisma.io>
When baseRef === skillBundle.ref (prepare a run against current main with the current skill bundle), the overlay is byte-identical to the base checkout and the materialized trees are gitignored, so git add -A stages nothing and the baseline commit exited non-zero. Use git commit --allow-empty so prepareCommit is always a real commit usable as the collect-run diff cut point, regardless of whether the overlay introduced changes. Adds two tests: the no-op-overlay case yields a 40-char prepareCommit without throwing, and an end-to-end prepareRun + collectRun proving the cut point holds with an empty overlay (zero changes before agent work, picks up a post-baseline change). Signed-off-by: Will Madden <madden@prisma.io>
…satisfied) Signed-off-by: Will Madden <madden@prisma.io>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (5)
📒 Files selected for processing (13)
📝 WalkthroughWalkthroughThis PR adds a complete "prepare → run → collect → manifest" workflow to the drive-judge-harness. It introduces modules for isolated run directory preparation with skill-bundle overlay, trace/diff collection, and orchestration, while integrating configurable working directories into the existing run-one-brief harness through a new ChangesDrive Judge Harness: Run-Arm Workflow and Integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
@prisma-next/extension-author-tools
@prisma-next/mongo-runtime
@prisma-next/family-mongo
@prisma-next/sql-runtime
@prisma-next/family-sql
@prisma-next/extension-arktype-json
@prisma-next/middleware-cache
@prisma-next/mongo
@prisma-next/extension-paradedb
@prisma-next/extension-pgvector
@prisma-next/extension-postgis
@prisma-next/postgres
@prisma-next/sql-orm-client
@prisma-next/sqlite
@prisma-next/target-mongo
@prisma-next/adapter-mongo
@prisma-next/driver-mongo
@prisma-next/contract
@prisma-next/utils
@prisma-next/config
@prisma-next/errors
@prisma-next/framework-components
@prisma-next/operations
@prisma-next/ts-render
@prisma-next/contract-authoring
@prisma-next/ids
@prisma-next/psl-parser
@prisma-next/psl-printer
@prisma-next/cli
@prisma-next/cli-telemetry
@prisma-next/emitter
@prisma-next/migration-tools
prisma-next
@prisma-next/vite-plugin-contract-emit
@prisma-next/mongo-codec
@prisma-next/mongo-contract
@prisma-next/mongo-value
@prisma-next/mongo-contract-psl
@prisma-next/mongo-contract-ts
@prisma-next/mongo-emitter
@prisma-next/mongo-schema-ir
@prisma-next/mongo-query-ast
@prisma-next/mongo-orm
@prisma-next/mongo-query-builder
@prisma-next/mongo-lowering
@prisma-next/mongo-wire
@prisma-next/sql-contract
@prisma-next/sql-errors
@prisma-next/sql-operations
@prisma-next/sql-schema-ir
@prisma-next/sql-contract-psl
@prisma-next/sql-contract-ts
@prisma-next/sql-contract-emitter
@prisma-next/sql-lane-query-builder
@prisma-next/sql-relational-core
@prisma-next/sql-builder
@prisma-next/target-postgres
@prisma-next/target-sqlite
@prisma-next/adapter-postgres
@prisma-next/adapter-sqlite
@prisma-next/driver-postgres
@prisma-next/driver-sqlite
commit: |
size-limit report 📦
|
At a glance
A Drive run's quality is a function of the skill version that drove it — so to measure a skill change, the run input must include which skills ran. This slice makes the skill bundle a first-class, pinned run input alongside
modeland(brief, base). The harness can now isolate a checkout, inject a specified skill bundle, run the orchestrator there, and collect the run's trace + agent-only diff:prepare-runisolates + injects + materializes;run-one-briefspawns the orchestrator in that checkout;collect-runharvests the emitted trace and the agent's diff. The run manifest recordsbase_sha,skill_bundle_sha,run_dir, the collected trace paths, and the diff stat.The decision
The materialized skill trees (
.cursor/,.claude/,.agents/skills/) are gitignored, so a clean checkout has no skills until something runs thepreparehook. That means every meaningful run needs a setup step — even one against currentmain— not just A/B runs. We make that step explicit and reproducible:skills-contrib/,.agents/rules/,AGENTS.md/CLAUDE.md) define it. Materialization reuses the repo's ownpreparehook (skills add+sync-agent-rules); the harness does not reinvent it.prepare-runfinalizes with agit commit --allow-emptyafter overlay + materialize;collect-rundiffs against that commit, so the injected skill overlay never pollutes the agent's changes. (The--allow-emptymatters: when base == bundle the overlay is byte-identical and stages nothing — this was caught and fixed in review.)*.jsonl(matched byorchestrator_agent_id, else newest), leavingemit.tsand the emission protocol untouched.This is the run-production half of the always-anticipated
experiment-enginesplit (project plan § Sequencing rationale): it unblocks live corpus generation on its own and is the prerequisite for the k=N A/B engine (TML-2737), where an arm is exactly(brief+base, model, skill-bundle)with one axis varied.What's in the diff
prepare-run.ts,collect-run.ts,run-arm.ts(+ 4 test suites). AdditiveRunManifestfields. Acwdthread-through inrun-one-brief.ts/sdk-adapter.ts(the spawn was hard-pinned toprocess.cwd()).design-notes.md§ Run setup;plan.mdrecords the executed split (now 5 slices, with the project-boundary rationale).@cursor/sdkis still reached only viasdk-adapter.ts's dynamic import on the--livepath — none of the new modules or their tests import it. Everything is exercised with the SDK absent and noCURSOR_API_KEY(git + fs + mocked materialize/agent).Gates
node --test570/570 ·pnpm lint:depspass ·pnpm lint:castspass (delta=0) ·pnpm typecheckunchanged (pre-existingtarget-postgresfailures only;skills-contribis not in the typecheck graph).Scope (deliberately out)
The k=N A/B loop, cross-run aggregation, dashboard, and CI regression gate (TML-2737); the env-pinned trace destination (recorded escape hatch); admitting
@cursor/sdkinto the lockfile + an actual live run (operator-gated).Linear: TML-2755 (blocks TML-2737)
Summary by CodeRabbit
New Features
Documentation
Tests