Lift drill into superpowers as evals/ harness by obra · Pull Request #1488 · obra/superpowers

obra · 2026-05-06T20:51:40Z

What problem are you trying to solve?

Drill — the standalone Python skill-compliance benchmark at obra/drill — is already the de facto eval harness for superpowers. The PRI-1397 commit series lifted ~22 bash tests into drill scenarios, and the most recent superpowers commit on dev (a2292c5) explicitly removed a redundant bash test with the message "replaced by drill behavioral coverage". Drill is a sibling repo today, requiring contributors to clone two checkouts and set SUPERPOWERS_ROOT manually. This PR completes the migration: drill becomes superpowers/evals/.

What does this PR change?

Lifts the obra/drill repo (at SHA 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b) into superpowers as evals/, with explicit rsync excludes (.git, .venv, results, .env, __pycache__, *.egg-info, .private-journal).
Adds a _set_superpowers_root_default() helper to drill/cli.py so SUPERPOWERS_ROOT defaults to the parent of evals/ — no manual env-var setup.
Drops SUPERPOWERS_ROOT from required_env in codex.yaml/gemini.yaml (the helper supplies it). claude*.yaml keep it because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.
Deletes redundant bash tests under tests/skill-triggering/, tests/explicit-skill-requests/, tests/subagent-driven-dev/, and tests/claude-code/ — gated per-file by a subagent that compared each bash test's assertions to its drill scenario's verify block. Anything not 100% covered was kept and annotated (3 tests retained: test-worktree-native-preference.sh, test-subagent-driven-development-integration.sh, test-subagent-driven-development.sh).
docs/testing.md split into Plugin tests + Skill behavior evals.
README.md Contributing section and CLAUDE.md gain pointers to evals/.

Is this change appropriate for the core library?

Yes. Cross-runtime evaluation is core to superpowers, the migration to drill scenarios was already underway in this repo, and the eval harness needs to be discoverable in-tree to be findable.

What alternatives did you consider?

Vendored copy + sync script (drill repo continues independently). Rejected: divergence risk; single-source-of-truth wins.
git subtree merge (preserves drill history in-tree). Rejected: superpowers' git history grows by 50+ commits, the merge commit is ugly, subtrees are operationally heavy.
Keep drill as a sibling repo and just polish docs. Rejected: doesn't solve the discoverability problem.

Does this PR contain multiple unrelated changes?

No — every change supports "drill is now evals/ inside superpowers". Multiple commits for atomicity (verbatim copy, env helper, YAML updates, per-file deletion gates, docs) but one direction.

Existing PRs

I have reviewed all open AND closed PRs for duplicates or prior art
Related PRs: Cross-platform skill compatibility: agent-neutral prose, source-verified per-runtime tool refs #1486 (cross-platform skill rewrites — independent; no shared file changes besides README, which has no overlap)

Environment tested

Harness	Version	Model	Model ID
Claude Code	local install	Opus 4.7 (1M)	claude-opus-4-7

Drill's own pytest suite (122 tests) passes from the new location with SUPERPOWERS_ROOT unset (one upstream-drill flaky tmux test excluded; pre-existing in the lifted SHA).
triggering-test-driven-development drill scenario passes from evals/ after the path-default changes.
tests/brainstorm-server/server.test.js (regression gate for non-LLM tests): 25/25 pass.
Larger drill sweep deferred to release-cadence runs per the spec's deferred-CI policy.

Evaluation

Initial prompt: see linked spec at docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md.
Drill's own pytest suite passes.
One drill scenario re-run from the new location end-to-end (proves the SUPERPOWERS_ROOT default works).
Per-deleted-file subagent verification recorded in each deletion commit's message.
Two-reviewer adversarial pass after implementation; four legitimate findings addressed in e4191c3:
- uv sync --dev → uv sync --extra dev (drill's pyproject uses [project.optional-dependencies])
- Stale test-requesting-code-review.sh reference in tests/claude-code/run-skill-tests.sh
- Stale test-requesting-code-review.sh section in tests/claude-code/README.md
- Removed unsupported "Copilot CLI" claim from docs/testing.md and CLAUDE.md (no copilot.yaml backend exists)

Rigor

If this is a skills change: this is not a skills change; it's a tooling/infrastructure migration. No behavior-shaping content modified.
Adversarial pressure-tested: two parallel reviewers on the spec; final adversarial pre-PR review on the implementation; spec already corrected for findings before implementation began.
Did not modify carefully-tuned content.

Human review

A human has reviewed the COMPLETE proposed diff before submission

Action items after merge

Archive obra/drill on GitHub (mark read-only, add README pointer to obra/superpowers/evals/).
The spec lists CI integration, scenario co-location with skills, and Python package rename as deferred work. Open issues for any of these you want tracked.

Records scope, branching, architecture, deletion gate, verification protocol, path/config edits, migration ordering, and post-implementation verification. Frames CI integration, scenario co-location, and Python package rename as deferred work. Per-file deletion of bash tests under superpowers/tests/ is gated by a subagent that compares each bash assertion to its drill scenario's verify block. Default keeps the bash test if any assertion is unmatched. Branching: independent off dev (f/evals-lift), not stacked on f/cross-platform.

Two parallel reviewers raised legitimate issues against the lift-drill- into-evals spec. Updates: - Coverage map for tests/explicit-skill-requests/ corrected: 6 run-*.sh scripts + prompts, not "2 scenarios cover all". Several scripts (Haiku, multi-turn, please-use-brainstorming, use-systematic-debugging) have no drill counterpart and stay. - tests/claude-code/test-subagent-driven-development.sh marked as meta/documentation test (asks agent to describe SDD); no drill scenario covers description tests; defaults to keep. - Path-defaults section now shows verified evidence: PROJECT_ROOT resolves to evals/ post-move; only claude*.yaml substitute ${SUPERPOWERS_ROOT} in args (codex/gemini use it via os.environ in pre-run hooks); helper invocation order specified (after load_dotenv, before click definitions). - Step 2 copy uses explicit rsync excludes (.git, .venv, results, .env, __pycache__, *.egg-info, .private-journal); checksum-level verification rather than file-count. - Drill SHA recorded at copy time in commit message and evals/.drill-source-sha for divergence detection. - evals/tests/ pytest suite added to verification protocol. - Reference scrub list expanded: RELEASE-NOTES.md, docs/superpowers/plans/, .codex-plugin/ (corrected from .codex/), lefthook.yml. Excluded dirs called out (node_modules/, .venv/, evals/). - Historical plan docs / RELEASE-NOTES handling: annotate, don't rewrite. - evals/lefthook.yml move documented (drill ships its own; contributors run cd evals && lefthook run pre-commit manually). - PR description checklist includes archival action item for obra/drill post-merge. False finding rejected: svelte-todo fixture is complete on disk (design.md + plan.md + scaffold.sh present); reviewer #1 #3 dropped.

15-task implementation plan derived from the design spec at docs/superpowers/specs/2026-05-06-lift-drill-into-evals-design.md. Each task is bite-sized (2-5 min steps) with exact commands, exact file paths, and exact code where required. Subagent verification gates per the spec are written out as concrete prompt templates. Self-review: - Spec coverage: every spec section maps to a task - Placeholder scan: no TBD/TODO/placeholder/fill-in-later language - Type consistency: helper named _set_superpowers_root_default consistently; drill SHA recorded in evals/.drill-source-sha consistently

rsync of obra/drill@013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b into superpowers/evals/, excluding .git/, .venv/, results/, .env/, __pycache__/, *.egg-info/, .private-journal/. The drill repo is unaffected by this commit; archival is a separate manual step after this PR merges. Source SHA recorded at evals/.drill-source-sha for divergence detection.

Adds _set_superpowers_root_default() to drill/cli.py, called at module import after load_dotenv(). PROJECT_ROOT resolves to evals/ post-lift; its parent is the superpowers repo root, which is the correct value for SUPERPOWERS_ROOT. Existing env values are respected as overrides via os.environ.setdefault. Tests: - helper sets default when var is unset - helper does not override when var is already set

These backends only read SUPERPOWERS_ROOT via engine.py/setup.py's os.environ access, which the new cli.py default helper supplies automatically. claude*.yaml keep SUPERPOWERS_ROOT in required_env because they interpolate ${SUPERPOWERS_ROOT} into --plugin-dir args.

The cli.py helper now defaults the env var. Mention as override only.

…ing-* scenarios) Subagent verification confirmed each prompt's intent matches its corresponding drill scenario's turns[].intent verbatim, and each scenario has both a deterministic skill-called assertion and a semantic LLM criterion confirming the matching skill was loaded (actually a stronger check than the bash test, which only confirms the skill fires anywhere in the stream). All 6 prompts deleted. The runner had no remaining prompts to drive, so run-test.sh and run-all.sh deleted as well.

…rsation-skill-invocation) Subagent verification: every bash assertion (Skill tool invoked + specific skill name 'subagent-driven-development' loaded after the agent describes it conversationally in turn 1) maps to the drill scenario's skill-called assertion + criteria paragraph requiring the skill to fire in direct response to the second user message. Drill additionally asserts tool-called Agent (subagent dispatch) which is stricter than the bash test. Other runners in tests/explicit-skill-requests/ (haiku, multiturn, extended-multiturn) and their prompt files are preserved — they have no drill coverage and exercise different behaviors.

…ractals + sdd-svelte-todo) The bash test had ZERO output assertions — it just ran claude -p and printed token usage. Drill's scenarios are strictly more rigorous: go-fractals: skill-called SDD + tool-called Agent + go test ./... passes + cmd/fractals/main.go exists + >=4 commits + LLM criteria verifying real SDD workflow. svelte-todo: skill-called SDD + tool-called Agent + npm test passes + playwright e2e passes + package.json + svelte.config.js or vite.config.ts + >=4 commits + LLM criteria. design.md and plan.md are byte-identical between bash fixtures and drill fixtures (evals/fixtures/sdd-{go-fractals,svelte-todo}/). Drill's setup helper (scaffold_sdd_*) forces git init -b main (stricter than bash's reliance on init.defaultBranch). The .claude/settings.local.json from bash scaffold.sh is unnecessary for drill since permissions are managed via backend YAML. Subagent verification: SAFE TO DELETE for both.

…eviewer-catches-planted-flaws) Subagent verification: every bash assertion (TODO in Requirements section flagged, "specified later" deferral flagged, Issues section present, did-not-approve verdict) maps to drill verify.criteria entries. Setup parity covered by setup.assertions (test-feature-design.md exists with TODO + 'specified later' content). Drill is stricter: asserts tool-called Agent (subagent dispatch) which the bash test did not check.

…eview-catches-planted-bugs) Subagent verification: every bash assertion (skill invocation, subagent dispatch, SQL injection flagged, credential handling flagged, no merge approval) maps to drill verify checks. Drill is stricter: bundles severity (Critical/Important) into the same criteria as the finding itself (bash split severity into a separate test). Setup parity covered (src/db.js with string concat + identity hash, two commits). The drill scenario header explicitly says it is the "cross-harness, semantically-judged replacement for the bash test."

- test-worktree-native-preference.sh: drill covers PRESSURE phase only; RED + GREEN baselines have no drill counterpart and are kept so the RED-GREEN-REFACTOR validation remains rerunnable end-to-end. - test-subagent-driven-development-integration.sh: drill covers the YAGNI subset (forbidden exports + reviewer-as-gate). Bash adds >=3 commits, >=2 subagent dispatches, TodoWrite usage, test file existence check, and token-budget telemetry. Kept until drill scenario covers those or they are retired. - test-subagent-driven-development.sh: tests agent's ability to *describe* SDD (string matches against expected keywords). Drill scenarios test behavior, not description-recall. Kept by design. Subagent verification recorded in commit messages of subsequent deletions; gap analyses driving these annotations are also in the verification subagent reports for the gating sweep.

- RELEASE-NOTES.md: note that test-requesting-code-review.sh and test-document-review-system.sh were lifted into drill scenarios on 2026-05-06; references are preserved as dated artifacts. - docs/superpowers/plans/2026-03-23-codex-app-compatibility.md: note that tests/skill-triggering/ was lifted into drill scenarios on 2026-05-06; the run-all.sh reference is a dated artifact. Subagent second-pass scrub confirmed no other active references in the tree (excluding evals/ and the spec/plan for this work itself).

- docs/testing.md split into Plugin tests + Skill behavior evals. Plugin tests section enumerates the bash tests that survive (kept by drill-coverage analysis or as describe-skill tests). - CLAUDE.md adds Eval harness section pointing at evals/. - README.md Contributing section mentions evals/ alongside tests/. - .gitignore adds evals/{results,.venv,.env} as belt-and-suspenders (evals/.gitignore covers these locally; root-level entries help tooling that does not recurse into nested ignore files).

- evals/README.md, evals/CLAUDE.md: fix uv install command from 'uv sync --dev' to 'uv sync --extra dev'. Drill's pyproject.toml uses [project.optional-dependencies], so --dev is a no-op for pytest/ruff/ty; --extra dev is the correct invocation. - tests/claude-code/run-skill-tests.sh: drop test-requesting-code-review.sh from integration_tests array (file deleted earlier in this branch). - tests/claude-code/README.md: replace test-requesting-code-review.sh section with test-worktree-native-preference.sh (the worktree test is kept; the code-review test was lifted into drill). - docs/testing.md, CLAUDE.md: remove "Copilot CLI" from the harness list. evals/backends/ has claude*, codex, gemini configs but no copilot.yaml, so the claim was unsupported. Adversarial review credit: reviewer #2 found four legitimate issues (uv-sync, run-skill-tests stale ref, README stale ref via #1, and Copilot CLI fabrication); reviewer #1 found two distinct issues (run-skill-tests + tests/claude-code/README.md). Reviewer #2 wins this round.

arittr

LGTM

Jesse Vincent added 16 commits May 6, 2026 11:54

evals: drop SUPERPOWERS_ROOT setup step from README/CLAUDE

a94d2cc

The cli.py helper now defaults the env var. Mention as override only.

obra requested a review from arittr May 6, 2026 20:51

arittr added 4 commits May 6, 2026 14:43

evals: remove unreleased wave scenarios

af465f9

evals: drop drill source marker

2d4cdea

evals: add Gemini 2.5 Flash backend

ec9b96a

evals: use pre-commit hooks

bad4708

arittr approved these changes May 6, 2026

View reviewed changes

arittr merged commit 7f02ccd into dev May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lift drill into superpowers as evals/ harness#1488

Lift drill into superpowers as evals/ harness#1488
arittr merged 20 commits into
devfrom
f/evals-lift

obra commented May 6, 2026

Uh oh!

arittr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

obra commented May 6, 2026

What problem are you trying to solve?

What does this PR change?

Is this change appropriate for the core library?

What alternatives did you consider?

Does this PR contain multiple unrelated changes?

Existing PRs

Environment tested

Evaluation

Rigor

Human review

Action items after merge

Uh oh!

arittr left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants