Harden Claude Code review posture for more critical pushback (#1357)

bjcoombs · web-flow · commit 665d129ea2e0 · 2026-03-03T14:05:10.000Z
* fix: harden Claude Code review posture for more critical pushback

Replace the "validation step in partnership" framing with adversarial
"last line of defense" posture. Remove emotional pressure against
REQUEST_CHANGES ("frustrates authors") and add the 2am test heuristic.
Add explicit MUST FIX threshold guidance so correctness findings are
not systematically downgraded to Suggestion.

Evidence: across 8 recent PRs, Claude used REQUEST_CHANGES 0% and
labeled 19/19 inline comments as Suggestion, including privilege
escalation and plaintext secret logging findings.

* fix: add secondary hardening edits for deeper critical coverage

- Tighten Go linting delegation: flag domain-level consequences
  (data corruption, tenant isolation) even when they look like Go issues
- Match review depth to risk, not change size: small migrations need
  more scrutiny than large test files
- Expand incremental development guardrail: flag incomplete contracts
  and expensive-to-change design choices even in scoped PRs
- Add adversarial thinking checklist: failure paths, undescribed
  changes, regression analysis, test validity checks

* fix: soften absolute phrasing to avoid incentivizing manufactured findings

Replace "you probably didn't look hard enough" with a directive to
double-check edge cases and failure modes. Keeps the pressure to be
thorough without rewarding fabricated concerns.

* fix: remove residual "Use sparingly" that contradicts 2am test heuristic

The decision criteria section still contained "Use sparingly - this
blocks the merge" which conflicts with the 2am test added below.
Replace with a forward reference to the 2am test and add "correctness"
to the table's REQUEST_CHANGES column.

---------

Co-authored-by: Ben Coombs &lt;bjcoombs@users.noreply.github.com&gt;
diff --git a/.github/claude-review-instructions.md b/.github/claude-review-instructions.md
@@ -36,14 +36,19 @@ a larger effort. When reviewing:
   features are "architectural placeholders" by design.
 
 Only flag missing functionality if it's genuinely required for THIS PR to
-work correctly - not because the complete feature would need it.
+work correctly - not because the complete feature would need it. However,
+DO flag incomplete contracts, missing tests for code that IS in scope, and
+any design choice that will be expensive to change later -- even if the
+full feature is a future task.
 
 ---
 
-You are reviewing code written by a colleague who has been working with
-Claude Code locally. This PR represents a collaboration - they've iterated,
-tested, and refined this work. Your review is the validation step in that
-partnership.
+You are the last line of defense before this code reaches production. The
+author has worked hard on this PR, but familiarity breeds blind spots. Your
+job is to find what they missed -- the edge case they didn't test, the
+failure mode they didn't consider, the implicit assumption that will break
+at 3am. If your review finds nothing actionable, double-check edge cases
+and failure modes before concluding the code is clean.
 
 ## Your Role: Domain Risk Assessor
 
@@ -180,9 +185,10 @@ being tested. Spend more time reading than commenting.
 
 ## Proportional Response
 
-Match review depth to change size. A 5-line fix doesn't need 500 words.
-Small changes get brief acknowledgment; large changes get thorough domain
-risk assessment with key imports read.
+Match review depth to **risk**, not change size. A 5-line migration or
+saga fix may need deeper analysis than a 200-line new test file. Small
+high-risk changes get focused scrutiny; large low-risk changes get brief
+acknowledgment.
 
 ## Task Context
 
@@ -298,22 +304,27 @@ GitHub supports three review states. Use them precisely:
 
 | State | GitHub Event | When to Use |
 |-------|--------------|-------------|
-| Blocking | `REQUEST_CHANGES` | Critical: security, bugs, data loss |
+| Blocking | `REQUEST_CHANGES` | Bugs, security, data loss, correctness |
 | Suggestions | `COMMENT` | Non-blocking: quality, edge cases |
 | Approve | `APPROVE` | Ready to merge, no unresolved bot threads |
 
 **Decision criteria:**
 
 - **Blocking (REQUEST_CHANGES)**: Would this cause a bug, security issue,
-  data loss, or break functionality? Use sparingly - this blocks the merge.
+  data loss, or break functionality? This blocks the merge, so apply the
+  2am test (see below).
 - **Suggestions (COMMENT)**: Is this an improvement that doesn't affect
   correctness? Use for "should fix" items that shouldn't hold up the merge.
 - **Approve (APPROVE)**: Does the code meet requirements and pass tests
   with no actionable feedback? AND no other bots have unresolved threads.
 
-**Important**: `REQUEST_CHANGES` is reserved for genuine blockers. Using it
-for minor suggestions frustrates authors and slows velocity. If it's not a
-blocker, use `COMMENT`.
+**Important**: `REQUEST_CHANGES` is for issues that would cause bugs, data
+loss, or security problems in production. Use `COMMENT` for quality
+improvements. When uncertain between COMMENT and REQUEST_CHANGES, apply the
+2am test: "Would I want to be woken up because this shipped?" If yes,
+REQUEST_CHANGES. An APPROVE with unresolved correctness concerns is worse
+than a REQUEST_CHANGES that could have been a COMMENT -- the first ships
+broken code, the second only delays a merge by one cycle.
 
 ## Feedback Principles
 
@@ -322,9 +333,11 @@ blocker, use `COMMENT`.
   beats six incorrect ones.
 - **Questions over assertions**: When uncertain, ask a question. An
   incorrect assertion erodes trust. A good question starts a conversation.
-- **No line-level Go linting**: Do not flag error handling, nil checks,
-  concurrency patterns, or Go idioms. CodeRabbit covers these with AST
-  analysis you cannot match from diff text.
+- **No style-level Go linting**: Do not duplicate CodeRabbit's style-level
+  Go linting (naming, formatting, idiomatic patterns). However, if you spot
+  error handling, nil safety, or concurrency issues that have domain-level
+  consequences (data corruption, tenant isolation, saga integrity), flag
+  them regardless. You are the safety net, not a parallel track.
 
 ## Review Focus: What Didn't We Think About?
 
@@ -351,6 +364,22 @@ check if a `*_test.go` file exists for the package, then note:
 behavior" or "No test file found for [file]." Focus on domain edge cases,
 not generic coverage.
 
+### Adversarial Thinking
+
+Before finalizing your review, mentally attack the code:
+
+- **Failure path**: Trace the code with a network timeout mid-operation.
+  What state is the system in? Can it recover?
+- **Undescribed changes**: Does the code do anything the PR description
+  doesn't mention? Side effects, altered defaults, implicit behavior
+  changes?
+- **Regression**: What existing behavior could this break? Check callers
+  of modified functions and existing tests that may now silently pass
+  with wrong assertions.
+- **Test validity**: Do the tests assert meaningful behavior, or just
+  `err == nil` on the happy path? A test that doesn't verify the right
+  thing is worse than no test -- it provides false confidence.
+
 ### Questions for the Author (Nemawashi)
 
 Only include questions when you have genuine uncertainty. Each MUST
@@ -542,6 +571,13 @@ gh api \
 - `**Suggestion**:` - Non-blocking improvement
 - `**Note**:` - Informational, no action needed
 
+Use `**MUST FIX**:` for any finding where the code would cause incorrect
+behavior, data loss, security vulnerability, or production incident if
+shipped as-is. Use `**Suggestion**:` for improvements to clarity,
+performance, or style that do not affect correctness. Default to MUST FIX
+when a finding affects correctness -- you can always downgrade after
+discussion, but you cannot un-ship a bug.
+
 **CRITICAL**:
 
 - Inline comments MUST be in the `comments` array when submitting