Skip to content

fix(brainstorming): ground recommendations in named comparison dimensions#1512

Draft
j7an wants to merge 1 commit into
obra:devfrom
j7an:fix/brainstorming-grounded-recommendations
Draft

fix(brainstorming): ground recommendations in named comparison dimensions#1512
j7an wants to merge 1 commit into
obra:devfrom
j7an:fix/brainstorming-grounded-recommendations

Conversation

@j7an
Copy link
Copy Markdown

@j7an j7an commented May 9, 2026

What problem are you trying to solve?

Issue #1266 describes a reproducible failure mode in the brainstorming skill: the first-pass recommendation is often surface-level rather than grounded in real analysis of the options, and the recommendation flips when the user asks "can you analyze these options in more depth?" The instability is the diagnostic — if the answer changes under deeper questioning, the original wasn't based on an in-depth comparison.

Concrete user experience the issue describes: initial recommendation X → user asks for detailed analysis → recommendation changes to Y, which is clearly better reasoned. Users who don't think to push back act on a recommendation the agent itself would revise under scrutiny.

This is distinct from already-tracked concerns:

What does this PR change?

Three coordinated single-file edits to skills/brainstorming/SKILL.md:

  1. The "Exploring approaches" block is replaced with a stability-test framing ("if your recommendation would flip under deeper questioning, the comparison wasn't deep enough — that's the bug") plus four explicit comparison dimensions: What it assumes, Where it breaks down, What would rule it out, What evidence supports it. The recommendation must lead and be tied to that comparison.
  2. Checklist item Option/config to disable git workflow #4 is rephrased to match: "compare in detail, recommend with reasoning tied to the comparison."
  3. The "Explore alternatives" key principle is rephrased to "Always compare 2-3 approaches in detail before recommending" (em-dash also substituted; intentional, so future style cleanups don't undo it).

Diff stats: 1 file changed, 14 insertions, 5 deletions.

Is this change appropriate for the core library?

Yes. The brainstorming skill ships in core and is one of the project's most heavily-used behavior-shaping skills. The change is harness-neutral Markdown — no new tool calls, no third-party dependencies, no domain-specific or project-specific content. The four dimensions (assumptions, failure modes, disqualifying constraints, evidence) are the ones spelled out in issue #1266's "Proposed mechanism" and apply to brainstorming any kind of decision, not a specific stack or tool.

What alternatives did you consider?

From issue #1266 itself plus design exploration:

  1. Force users to manually ask for deeper analysis — works, but shifts the burden to the user and assumes they know to ask. The whole point of the skill is to do this without the user having to push back.
  2. Run a separate critique ([Feature Request] Add critique skill for nine-dimension self-review of AI responses #1013) skill after every brainstorm — post-hoc; the recommendation has already been presented and anchored. Doesn't address the first-pass instability.
  3. Option Zero check (Suggestion: brainstorming skill — add 'Option Zero' forcing question before proposing solutions #834) alone — solves the "simplest-option" case but not the general case of shallow comparison across real alternatives.
  4. Add new sections / a new sub-skill / a new process step — rejected as scope creep. The smallest wording change that addresses the observed bug is best per superpowers:writing-skills discipline.

The chosen approach (minimal coordinated wording changes that name the failure mode and the four dimensions) is what survived all four alternatives.

Does this PR contain multiple unrelated changes?

No. One file, three edits, all in service of a single goal: making first-pass brainstorming recommendations stable under "can you analyze in more depth?" pushback. The checklist item and Key Principles bullet exist solely to keep voice and terminology aligned with the new "Exploring approaches" block — splitting them across separate PRs would leave the skill internally contradictory.

Existing PRs

  • I have reviewed all open AND closed PRs for duplicates or prior art.

Searched obra/superpowers open AND closed PRs for "brainstorming" (50 results). None duplicate this change. The substantively-near PRs:

  • Enhance brainstorming skill with Iron Law enforcement #1168 OPEN — "Enhance brainstorming skill with Iron Law enforcement": a broader rewrite focused on enforcing what the skill calls the Iron Law. Different shape (whole-skill restructure) and different intervention point (gate-style enforcement, not depth-of-comparison wording). Not a duplicate.
  • Add proactive assumption challenging to brainstorming skill #541 CLOSED — "Add proactive assumption challenging to brainstorming skill": tries to make the skill challenge the user's request itself, not each candidate option's grounding. Different intervention point. Closed; my approach is narrower and targets a specific named failure mode (recommendation flips under pushback) rather than challenging the user.
  • feat(brainstorming): add research existing solutions step #386 OPEN — "feat(brainstorming): add research existing solutions step": adds a research-existing-solutions step. Different concern entirely (prior-art discovery vs depth of comparison among already-identified options).

Other open fix(brainstorming): PRs in the search results address distinct problems and are not duplicates: #1170 (section overview), #1169 (assumptions/unknowns at handoff), #1037 (visual companion gating), #1097 (worktree handoff), #829 (worktree creation step), #759 (auto-open visual companion), #632 (worktree-before-spec ordering).

One post-2026-05-09 PR matched the "brainstorming" search but is unrelated: #1507 ("Migrate superpowers to Pi platform with bilingual support and tests") — wholesale platform-migration PR, no overlap with the depth-of-comparison concern.

Related: #1266 (the issue this PR fixes). No prior PR addresses #1266.

Environment tested

Harness (e.g. Claude Code, Cursor) Harness version Model Model version/ID
Pending — adversarial eval sessions not yet run; see Evaluation and Rigor below.

New harness support (required if this PR adds a new harness)

N/A. This PR does not add support for a new harness. It modifies a skill that ships in core and is loaded by all currently-supported harnesses.

Clean-session transcript for "Let's make a react todo list"

Not applicable to this change (no new harness).

Evaluation

  • What was the initial prompt that led to this change? Issue brainstorming: recommendations should be grounded in evidence, not surface-level guesses #1266 itself — the issue's reporter described a reproducible flip in the brainstorming recommendation under the follow-up "can you analyze these options in more depth?" The same prompt is the locked adversarial probe used by the planned eval design (see below).
  • How many eval sessions did you run AFTER making the change? Zero — adversarial pressure-testing evals are not yet run. The plan locks three scenario prompts (Trivial / Complex / Dominant) and the adversarial follow-up, designed to be run in 6 fresh Claude Code sessions (3 baseline against dev, 3 against this branch). Those sessions have not been completed at the time of this draft. This PR is intentionally a DRAFT until those evals are run and pasted in.
  • How did outcomes change compared to before the change? Cannot honestly claim outcomes yet — see above. The expected pattern (per the design): instability on dev for at least the Complex scenario, stability on this branch in all three. Trivial and Dominant scenarios may legitimately not flip on dev (small or one-sided decisions can be stable without depth) — those will be documented honestly rather than treated as failures.

The locked scenario prompts that will produce the eval evidence:

Trivial: I'm adding a new env var for per-request HTTP timeout. Should I name it REQUEST_TIMEOUT_MS, HTTP_TIMEOUT_MS, or API_TIMEOUT_MS?

Complex: I want to add background job processing to a small Flask app deployed on a single VM. Should I use Celery + Redis, Dramatiq + RabbitMQ, or apscheduler with a SQLite backend?

Dominant: For a brand-new internal tool's auth, should I roll my own session cookies, integrate Auth0, or use the company's existing SSO via SAML?

Adversarial follow-up (sent after each first-pass response): Can you analyze these options in more depth?

Rigor

  • If this is a skills change: I used superpowers:writing-skills and completed adversarial pressure testing (paste results below)
  • This change was tested adversarially, not just on the happy path
  • I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without extensive evals showing the change is an improvement.

Both adversarial-test boxes are intentionally unchecked. The first two require completed eval sessions. They are not yet run. Marking those boxes without the data would be fabrication. The third box is honest: this PR does not touch any Red Flags table, rationalization list, or "human partner" wording — those carefully-tuned regions are left intact.

The superpowers:writing-skills skill was invoked during implementation (its discipline is what produced the minimal, scope-limited three-edit change rather than a wider rewrite). What is missing is the eval-evidence requirement that completes the rigor checklist.

This PR is a draft specifically so the maintainers do not have to triage an under-rigored skill change. Promotion to ready-for-review is gated on completing the 6 eval sessions and pasting before/after results into this body.

Human review

  • A human has reviewed the COMPLETE proposed diff before submission.

The complete diff is one file (skills/brainstorming/SKILL.md), 14 insertions / 5 deletions, three localized hunks. My human partner is reviewing the diff in the GitHub UI as part of opening this draft.

Fixes #1266.

…ions

Replace the 3-bullet "Exploring approaches" block with a stability-test
framing plus four explicit dimensions (what it assumes, where it breaks
down, what would rule it out, what evidence supports it). Update
checklist item obra#4 and the "Explore alternatives" key principle to
match.

Refs obra#1266
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant