improve: deepen all four skill guides — coverage, output quality & trigger accuracy

hyhmrright · claude · hyhmrright · commit 8c0087b6ac4c · 2026-04-15T19:18:22.000+08:00
- decay-risks.md: promote Hyrum's Law and Orthogonality to named symptoms in R2
- pr-review-guide: add PR size calibration table, split Step 6 into 6a/6b,
  reframe severity block as tiebreaker referencing per-risk guides
- architecture-guide: add Step 5 Testability Seam Assessment (Feathers seam model),
  simplify Mermaid Phase A/B into linear instruction with explicit Phase B reminder,
  add Conway's Law calibration examples; renumber Conway's Law to Step 6
- debt-guide: add concrete Pain×Spread calibration examples, add Step 2b for
  intentional vs accidental debt classification (Cunningham definition),
  replace unmeasurable date-based criterion with observable payback-plan check
- test-guide: split Step 2 into 2a (Test Brittleness) + 2b (Mock Abuse) with
  merged single-pass sampling, add Characterization Test template (Feathers Ch.8),
  structure test performance guidance into three severity tiers
- all SKILL.md: rewrite trigger descriptions in natural language, add explicit
  DO NOT trigger guards (fixes brooks-debt false-triggering on HTTP health checks)

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/skills/_shared/decay-risks.md b/skills/_shared/decay-risks.md
@@ -78,8 +78,13 @@ and multiplies regression risk on every change.
 - A method uses more data from another class than from its own class
 - Two classes know each other's internal state directly
 - Changing one module requires recompiling or retesting many unrelated modules
-- Any observable behavior (including internal implementation details) is depended upon
-  by callers, creating a de facto interface beyond the declared API
+- **Hyrum's Law**: with sufficient callers, every observable behavior — including
+  implementation details, error message text, coincidental call ordering, and undocumented
+  side effects — becomes an implicit contract that callers depend on, even though it was
+  never guaranteed by the declared API
+- **Orthogonality violation**: changing one dimension of a feature forces edits in
+  unrelated dimensions — adding a new payment type should not require touching logging,
+  caching, or notification code, but in a non-orthogonal design it does
 - Information Leakage: a design decision (e.g., a file format, protocol detail, or data
   shape) is encoded in more than one module, so changing it requires coordinated edits
   in multiple places even though only one module "owns" the concept
diff --git a/skills/brooks-audit/SKILL.md b/skills/brooks-audit/SKILL.md
@@ -6,10 +6,15 @@ description: >
   Domain-Driven Design, A Philosophy of Software Design, Software Engineering at Google,
   xUnit Test Patterns, The Art of Unit Testing, Working Effectively with Legacy Code,
   and How Google Tests Software.
-  Triggers when: user asks to audit architecture, review module structure,
-  check system design, or assess project organization.
+  Triggers when: user asks to audit architecture, review module or folder structure,
+  check system design, understand how the codebase is organized, assess project layout,
+  or asks "is this a good design?", "where should I put X?", or "why does everything
+  depend on everything?".
   Also triggers when user mentions: clean architecture / dependency inversion /
-  hexagonal architecture / bounded contexts / module coupling / package structure.
+  hexagonal architecture / bounded contexts / module coupling / package structure /
+  tangled dependencies / circular imports / spaghetti code / directory layout.
+  Do NOT trigger for: PR-level code review (use brooks-review) or line-level refactoring
+  questions — this skill analyzes structural/module-level concerns, not individual functions.
   Use this skill proactively when project structure or module dependencies are discussed.
 ---
 
@@ -25,9 +30,10 @@ description: >
 ## Process
 
 1. Draw the module dependency graph as a Mermaid diagram (Step 1 of the guide)
-2. Scan for each decay risk in the order specified in the guide
-3. Assign node colors in the Mermaid diagram based on findings (red/yellow/green)
-4. Run the Conway's Law check
-5. Output using the Report Template from common.md — Mermaid graph FIRST, then Findings
+2. Scan for each decay risk in the order specified in the guide (Steps 2–4)
+3. Assign node colors in the Mermaid diagram based on findings (red/yellow/green) — do this after Step 4
+4. Run the Testability Seam Assessment (Step 5)
+5. Run the Conway's Law check (Step 6)
+6. Output using the Report Template from common.md — Mermaid graph FIRST, then Findings
 
 **Mode line in report:** `Architecture Audit`
diff --git a/skills/brooks-audit/architecture-guide.md b/skills/brooks-audit/architecture-guide.md
@@ -87,11 +87,11 @@ graph TD
   class Database,MessageQueue,AuthService,WebApp,MobileApp clean
 ````
 
-**Phase A (during Step 1):** Generate the graph structure only — nodes, subgraphs, and edges.
-Do NOT add `classDef` or `class` lines yet. You need findings from Steps 2-4 before coloring.
+Draw the graph structure first — nodes, subgraphs, and edges — without any `classDef` or
+`class` lines. You cannot assign colors until you have completed the risk scan in Steps 2–4.
 
-**Phase B (after Step 4):** Add `classDef` definitions and `class` assignments based on findings.
-The example above shows the final output after both phases.
+**After completing Step 4**, return to this graph and add the `classDef` and `class` lines
+based on findings. The example above shows the final colored output.
 
 Rules:
 1. **Nodes** — Use top-level directories or services as nodes, not individual files
@@ -142,7 +142,30 @@ Check each in turn:
 - Can the module responsibility of each module be stated in one sentence from its name alone?
 - Would a new developer know which module to add a new feature to?
 
-### Step 5: Conway's Law Check
+### Step 5: Testability Seam Assessment
+
+A *seam* is a place in the architecture where behavior can be altered without editing source
+code — typically an interface, a configuration point, or a dependency injection boundary.
+Seam density is a proxy for testability and evolvability.
+
+Scan for:
+- **No seam at the infrastructure boundary**: can you replace a real database, file system,
+  or HTTP client with a test double without editing the module under test? If not, the
+  architecture forces integration tests where unit tests would suffice.
+- **Seam collapse**: a module that was once testable in isolation has had its seams removed
+  (e.g., direct constructor instantiation replaced a dependency injection point, or a global
+  singleton replaced an injected collaborator).
+- **Missing seam in legacy areas**: modules without an obvious injection point or interface
+  boundary — any change requires touching the entire call stack to substitute behavior.
+
+If all modules have clear seams at their infrastructure boundaries → no finding.
+
+If seams are absent or collapsed: flag as 🟡 Warning with a Remedy pointing to the specific
+module and the injection point that needs to be restored or introduced.
+
+Source: Feathers — Working Effectively with Legacy Code, Ch. 4: The Seam Model
+
+### Step 6: Conway's Law Check
 
 After the six-risk scan, assess the relationship between architecture and team structure:
 
@@ -153,6 +176,14 @@ After the six-risk scan, assess the relationship between architecture and team s
 - A mismatch that is theoretical but not yet causing pain is 🟡 Warning.
 - If team structure is unknown, note this as context missing and skip the check.
 
+**Calibration examples:**
+- 🔴 Critical: the Payments module is owned by Team A but contains auth logic owned by Team B —
+  every Payments change requires a sync meeting with Team B
+- 🟡 Warning: two separate teams own the `utils/` and `helpers/` directories which do the same
+  things — theoretically painful but not yet causing release coordination issues
+- Not a finding: a single team owns a monorepo with multiple logical modules — Conway's Law
+  misalignment requires *separate teams* to be meaningful
+
 ---
 
 ## Applying the Iron Law
diff --git a/skills/brooks-debt/SKILL.md b/skills/brooks-debt/SKILL.md
@@ -6,10 +6,15 @@ description: >
   Domain-Driven Design, A Philosophy of Software Design, Software Engineering at Google,
   xUnit Test Patterns, The Art of Unit Testing, Working Effectively with Legacy Code,
   and How Google Tests Software.
-  Triggers when: user asks about tech debt, where to refactor, health check,
+  Triggers when: user asks about tech debt, where to refactor, what to clean up first,
+  codebase health (in the software quality sense — not server/HTTP health endpoints),
   or systemic maintainability questions.
-  Also triggers when user asks why the codebase is hard to maintain,
-  why adding developers isn't helping, or why complexity keeps growing.
+  Also triggers when user asks: why the codebase is hard to maintain, why it's a mess,
+  why adding developers isn't helping, why complexity keeps growing, what the worst part
+  of the codebase is, or where to start paying back debt.
+  Do NOT trigger for: server health checks, HTTP /health endpoints, infrastructure monitoring,
+  or questions about application uptime — "health check" in those contexts means something
+  different and this skill is not relevant.
   Use this skill proactively when maintainability or refactoring priorities are discussed.
 ---
 
diff --git a/skills/brooks-debt/debt-guide.md b/skills/brooks-debt/debt-guide.md
@@ -56,13 +56,19 @@ After listing all findings, score each one:
 
 **Pain score (1–3):** How much does this slow down development today?
 - 3: Developers actively avoid touching this area; it causes bugs on most changes
+  *(e.g., "nobody wants to touch the billing module because it always breaks something")*
 - 2: This area is noticeably slower to work in than the rest of the codebase
+  *(e.g., "adding a field takes 2–3x longer here than elsewhere")*
 - 1: This is a quality issue but not currently causing active pain
+  *(e.g., "inconsistent naming, but we always know what we mean")*
 
 **Spread score (1–3):** How many files, modules, or developers does this affect?
 - 3: Affects 5+ modules or all developers on the team
+  *(e.g., "every new feature touches the God class in core/")*
 - 2: Affects 2–4 modules or a subset of the team
+  *(e.g., "the auth and notification modules are tightly coupled")*
 - 1: Isolated to one module or one developer's area
+  *(e.g., "legacy parser that only one person maintains")*
 
 **Priority = Pain × Spread** (max 9)
 
@@ -72,6 +78,23 @@ After listing all findings, score each one:
 | 4–6 | Scheduled debt | Plan within quarter |
 | 1–3 | Monitored debt | Log and watch |
 
+### Step 2b: Classify Debt Intent
+
+After scoring, classify each finding as intentional or accidental:
+
+**Intentional debt** — a conscious shortcut taken to meet a deadline, with the expectation
+of paying it back. The team knows about it. It may be legitimate (a strategic prototype,
+a known temporary workaround during a migration).
+
+**Accidental debt** — degradation that accumulated without a deliberate decision: the team
+did not choose it and may not even know it exists. This is the kind Ward Cunningham's
+original definition warned against — not a tactical trade-off, but structural erosion.
+
+Mark each finding with `[intentional]` or `[accidental]` in the Debt Summary Table.
+Intentional debt with no visible payback plan — no linked ticket, no code comment, no
+documented decision — should be treated as accidental for prioritization purposes.
+Focus remediation energy on accidental debt first; intentional debt at least has an owner.
+
 ### Step 3: Group by Decay Risk
 
 Report findings grouped by risk type, not by file or module.
@@ -91,14 +114,14 @@ After the Findings section, append a Debt Summary Table:
 ```
 ## Debt Summary
 
-| Risk | Findings | Avg Priority | Dominant Classification |
-|------|----------|-------------|------------------------|
-| Cognitive Overload | N | X.X | Monitored / Scheduled / Critical |
-| Change Propagation | N | X.X | ... |
-| Knowledge Duplication | N | X.X | ... |
-| Accidental Complexity | N | X.X | ... |
-| Dependency Disorder | N | X.X | ... |
-| Domain Model Distortion | N | X.X | ... |
+| Risk | Findings | Avg Priority | Dominant Classification | Intent |
+|------|----------|-------------|------------------------|--------|
+| Cognitive Overload | N | X.X | Monitored / Scheduled / Critical | intentional / accidental |
+| Change Propagation | N | X.X | ... | ... |
+| Knowledge Duplication | N | X.X | ... | ... |
+| Accidental Complexity | N | X.X | ... | ... |
+| Dependency Disorder | N | X.X | ... | ... |
+| Domain Model Distortion | N | X.X | ... | ... |
 
 **Recommended focus:** [The one or two risks with the highest average priority — these are
 where investment will have the most impact.]
diff --git a/skills/brooks-review/SKILL.md b/skills/brooks-review/SKILL.md
@@ -7,11 +7,15 @@ description: >
   xUnit Test Patterns, The Art of Unit Testing, Working Effectively with Legacy Code,
   and How Google Tests Software.
   Triggers when: user asks to review code, check a PR, review a pull request,
-  or shares a diff for feedback.
-  Also triggers when user mentions: Brooks's Law / Mythical Man-Month / conceptual integrity /
-  second system effect / code smells / refactoring / clean architecture / DDD /
-  domain-driven design / SOLID principles / Hyrum's Law / deep modules / tactical programming.
-  Use this skill proactively whenever code, a diff, or a PR is shared for review.
+  shares a diff or pastes code inline asking "does this look right?" or "is this okay?",
+  or asks for feedback on a specific function, class, or file.
+  Also triggers when user mentions: code smells / refactoring / clean architecture /
+  DDD / domain-driven design / SOLID principles / Hyrum's Law / deep modules /
+  tactical programming / conceptual integrity / Brooks's Law / Mythical Man-Month /
+  second system effect.
+  Do NOT trigger for: questions about how to write code from scratch, language syntax
+  questions, or questions about tools and frameworks where no existing code is being reviewed.
+  Use this skill proactively whenever existing code, a diff, or a PR is shared for review.
 ---
 
 # Brooks-Lint — PR Review
diff --git a/skills/brooks-review/pr-review-guide.md b/skills/brooks-review/pr-review-guide.md
@@ -11,6 +11,16 @@ in the changed code. Every finding must follow the Iron Law: Symptom → Source
 ORM migrations, lock files, minified bundles), skip those files entirely. Generated code reflects
 tool choices, not developer decisions. Note in the report which files were skipped and why.
 
+**Scope calibration:** Adjust analysis depth based on PR size before starting.
+
+| PR Size | Approach |
+|---------|----------|
+| < 50 lines | Focus on Steps 1–3 only; run Step 6a only if imports changed; run Step 6b if any class, method, or variable was renamed or introduced |
+| 50–300 lines | Full process, all steps |
+| > 300 lines | Full process; note in the Scope line that review is sampled — cover the highest-risk areas rather than every file |
+
+For PRs > 500 lines: flag in the Summary that a PR this size is itself a Change Propagation signal. A change that cannot be reviewed in one pass suggests tangled responsibilities.
+
 ---
 
 ## Analysis Process
@@ -61,15 +71,20 @@ Look for:
 - Does this change add a class that only wraps another class or delegates everything?
 - Does this change add configuration options or extension points that serve no current requirement?
 
-### Step 6: Scan for Dependency Disorder and Domain Model Distortion
+### Step 6a: Scan for Dependency Disorder
 
-Look for Dependency Disorder:
 - Do any new imports create a dependency from a high-level module to a low-level one?
-- Do any new imports introduce a cycle?
+  (e.g., domain service now imports a database driver or HTTP client)
+- Do any new imports introduce a cycle between modules?
+- Does any new interface force callers to depend on methods they do not use?
+
+If no new imports and no structural changes → skip, no finding.
 
-Look for Domain Model Distortion:
-- Do new class or variable names match the language the business uses?
-- Does any new class hold only data with no behavior, where behavior was expected?
+### Step 6b: Scan for Domain Model Distortion
+
+- Do new class or variable names match the language the business uses for the same concept?
+- Does any new class hold only data with no behavior (pure data bag), where behavior was expected?
+- Does any new method put logic that belongs to the domain in a service or utility layer?
 
 ---
 
@@ -89,6 +104,16 @@ Do not write a finding that you cannot complete fully. If you can identify a sym
 cannot state a consequence, you have not understood the risk well enough — re-read
 `../_shared/decay-risks.md` for that risk before writing the finding.
 
+**Severity calibration:** Each risk in `../_shared/decay-risks.md` has its own Severity
+Guide with numeric thresholds — use those as the primary reference. When a finding sits
+on the boundary between two tiers, use this as a tiebreaker:
+- 🔴 Critical — actively breaking velocity or creating production risk *today*
+- 🟡 Warning — will if left unaddressed through the next few features
+- 🟢 Suggestion — worth fixing when nearby, not urgent
+
+When multiple findings exist, list Critical items first. If there are more than 5 findings,
+add a one-line "Recommended fix order" at the end of the Findings section.
+
 ---
 
 ## Output
diff --git a/skills/brooks-test/SKILL.md b/skills/brooks-test/SKILL.md
@@ -4,10 +4,15 @@ description: >
   Test quality review drawing on twelve classic engineering books, with primary focus
   on xUnit Test Patterns, The Art of Unit Testing, How Google Tests Software,
   and Working Effectively with Legacy Code.
-  Triggers when: user asks about test quality, flaky tests, mock abuse,
-  test debt, legacy code testability, or shares test files for review.
+  Triggers when: user asks about test quality, shares test files for review,
+  or complains that tests keep breaking for no reason, tests are slow, tests are hard
+  to understand, test setup is complicated, or they can't tell what a test is testing.
   Also triggers when user mentions: test smells / characterization tests /
-  test pyramid / test doubles / over-mocking / brittle tests.
+  test pyramid / test doubles / over-mocking / brittle tests / flaky tests /
+  too many mocks / tests break on refactoring / slow test suite.
+  Do NOT trigger for: writing new tests from scratch (use the regular test-writing workflow)
+  or questions about testing frameworks and syntax — this skill reviews an existing test
+  suite for structural quality problems, not individual test authoring questions.
   Use this skill proactively whenever test files are shared for review.
 ---
 
diff --git a/skills/brooks-test/test-guide.md b/skills/brooks-test/test-guide.md