Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
cf34cef
Spec: lift drill into superpowers as evals/
May 6, 2026
cf5914a
Spec: address adversarial review findings
May 6, 2026
895bb73
Plan: lift drill into superpowers as evals/
May 6, 2026
3c046f5
Lift drill into evals/ at 013fcb8b7dbefd6d3fa4653493e5d2ec8e7f985b
May 6, 2026
b3817bb
evals: default SUPERPOWERS_ROOT to parent of evals/ if unset
May 6, 2026
dcffaa0
evals: drop SUPERPOWERS_ROOT from codex/gemini required_env
May 6, 2026
a94d2cc
evals: drop SUPERPOWERS_ROOT setup step from README/CLAUDE
May 6, 2026
3177c87
tests: remove skill-triggering bash prompts (covered by drill trigger…
May 6, 2026
6fe9cf7
tests: remove run-claude-describes-sdd.sh (covered by drill mid-conve…
May 6, 2026
d337f4a
tests: remove subagent-driven-dev fixtures (covered by drill sdd-go-f…
May 6, 2026
dc62552
tests: remove test-document-review-system.sh (covered by drill spec-r…
May 6, 2026
051bff6
tests: remove test-requesting-code-review.sh (covered by drill code-r…
May 6, 2026
11d5db1
tests: annotate three kept bash tests with drill coverage notes
May 6, 2026
b43d14f
docs: annotate dated artifacts referencing lifted bash tests
May 6, 2026
d545612
docs: introduce evals/ as the canonical skill-behavior eval harness
May 6, 2026
e4191c3
Address adversarial review findings
May 6, 2026
af465f9
evals: remove unreleased wave scenarios
arittr May 6, 2026
2d4cdea
evals: drop drill source marker
arittr May 6, 2026
ec9b96a
evals: add Gemini 2.5 Flash backend
arittr May 6, 2026
bad4708
evals: use pre-commit hooks
arittr May 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,9 @@
node_modules/
inspo
triage/

# Eval harness — drill ships its own gitignore at evals/.gitignore;
# these are belt-and-suspenders entries for tools that don't recurse.
evals/results/
evals/.venv/
evals/.env
21 changes: 21 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
repos:
- repo: local
hooks:
- id: evals-ruff-check
name: evals ruff check
entry: uv --project evals run ruff check
language: system
files: ^evals/.*\.py$

- id: evals-ruff-format-check
name: evals ruff format --check
entry: uv --project evals run ruff format --check
language: system
files: ^evals/.*\.py$

- id: evals-ty-check
name: evals ty check
entry: uv --directory evals run ty check
language: system
pass_filenames: false
files: ^evals/.*\.py$
4 changes: 4 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ Skills are not prose — they are code that shapes agent behavior. If you modify
- Show before/after eval results in your PR
- Do not modify carefully-tuned content (Red Flags tables, rationalization lists, "human partner" language) without evidence the change is an improvement

## Eval harness

Skill-behavior evals live at `evals/` — see `evals/README.md`. Drill (the harness) drives real tmux sessions of Claude Code / Codex / Gemini CLI and judges skill compliance with an LLM verifier. Plugin-infrastructure tests still live at `tests/`.

## Understand the Project Before Contributing

Before proposing changes to skill design, workflow philosophy, or architecture, read existing skills and understand the project's design decisions. Superpowers has its own tested philosophy about skill design, agent behavior shaping, and terminology (e.g., "your human partner" is deliberate, not interchangeable with "the user"). Changes that rewrite the project's voice or restructure its approach without understanding why it exists will be rejected.
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,8 @@ The general contribution process for Superpowers is below. Keep in mind that we
4. Follow the `writing-skills` skill for creating and testing new and modified skills
5. Submit a PR, being sure to fill in the pull request template.

Skill-behavior tests use the eval harness at `evals/`. See `evals/README.md` for setup. Plugin-infrastructure tests live at `tests/` and run via the relevant `run-*.sh` or `npm test`.

See `skills/writing-skills/SKILL.md` for the complete guide.

## Updating
Expand Down
2 changes: 2 additions & 0 deletions RELEASE-NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ New `sync-to-codex-plugin` script mirrors superpowers into the OpenAI Codex plug
- **Single source of truth** — the persona/checklist that previously lived in both `agents/code-reviewer.md` and the skill's placeholder template (and drifted independently) is now one file.
- **`subagent-driven-development` follows suit** — its `code-quality-reviewer-prompt.md` now dispatches `Task (general-purpose)` instead of the named agent.
- **Behavioral test added** — `tests/claude-code/test-requesting-code-review.sh` plants real bugs (SQL injection, plaintext password handling, credential logging) into a tiny project and asserts the dispatched reviewer flags every planted issue at Critical/Important severity and refuses to approve the diff.

> Note: `tests/claude-code/test-requesting-code-review.sh` and `tests/claude-code/test-document-review-system.sh` (mentioned later in this document) were lifted into drill scenarios on 2026-05-06 and removed from `tests/`. See `evals/scenarios/code-review-catches-planted-bugs.yaml` and `evals/scenarios/spec-reviewer-catches-planted-flaws.yaml`. The references above and below are preserved as dated artifacts of the work this section describes.
- **Codex and Copilot workaround docs trimmed** — the "Named agent dispatch" sections in `references/codex-tools.md` and `references/copilot-tools.md` documented how to flatten a named agent into a generic dispatch. With no named agents shipping, the workaround is unnecessary; both sections were dropped.

### Subagent-Driven Development
Expand Down
2 changes: 2 additions & 0 deletions docs/superpowers/plans/2026-03-23-codex-app-compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -555,6 +555,8 @@ Should show exactly 6 files changed (5 skill files + 1 test file). No other file
If test runner exists:
```bash
# Run skill-triggering tests
# Note: tests/skill-triggering/ was lifted into drill scenarios on 2026-05-06.
# See evals/scenarios/triggering-*.yaml. The reference below is a dated artifact.
./tests/skill-triggering/run-all.sh 2>/dev/null || echo "Skill triggering tests not available in this environment"

# Run SDD integration test
Expand Down
Loading