trying to make the reviewer more deterministic (#86)

derekmisler · web-flow · commit bc1819770f27 · 2026-03-12T05:54:39.000-04:00
diff --git a/review-pr/agents/evals/batched-verifier-security-1.json b/review-pr/agents/evals/batched-verifier-security-1.json
@@ -10,7 +10,7 @@
       "At least one finding flags the missing redirect_uri validation as a security concern",
       "The agent delegated all high and medium severity findings to the verifier in a single batch delegation, not multiple individual delegations",
       "The verifier returned a JSON response with a 'verdicts' array containing one verdict per finding",
-      "The agent decided to REQUEST_CHANGES because there is at least one high-severity confirmed or likely finding"
+      "The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/evals/batched-verifier-security-2.json b/review-pr/agents/evals/batched-verifier-security-2.json
@@ -10,7 +10,7 @@
       "At least one finding flags the missing redirect_uri validation as a security concern",
       "The agent delegated all high and medium severity findings to the verifier in a single batch delegation, not multiple individual delegations",
       "The verifier returned a JSON response with a 'verdicts' array containing one verdict per finding",
-      "The agent decided to REQUEST_CHANGES because there is at least one high-severity confirmed or likely finding"
+      "The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/evals/batched-verifier-security-3.json b/review-pr/agents/evals/batched-verifier-security-3.json
@@ -10,7 +10,7 @@
       "At least one finding flags the missing redirect_uri validation as a security concern",
       "The agent delegated all high and medium severity findings to the verifier in a single batch delegation, not multiple individual delegations",
       "The verifier returned a JSON response with a 'verdicts' array containing one verdict per finding",
-      "The agent decided to REQUEST_CHANGES because there is at least one high-severity confirmed or likely finding"
+      "The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/evals/security-redirect-uri-1.json b/review-pr/agents/evals/security-redirect-uri-1.json
@@ -9,7 +9,7 @@
       "The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
       "At least one finding flags the missing redirect_uri validation as a security concern in oidc.go or service.go",
       "The finding about redirect_uri validation has severity 'high' because accepting arbitrary redirect URIs without server-side validation is a well-known OAuth security vulnerability",
-      "The agent decided to REQUEST_CHANGES (not APPROVE and not COMMENT) because there is at least one high-severity finding"
+      "The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/evals/security-redirect-uri-2.json b/review-pr/agents/evals/security-redirect-uri-2.json
@@ -9,7 +9,7 @@
       "The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
       "At least one finding flags the missing redirect_uri validation as a security concern in oidc.go or service.go",
       "The finding about redirect_uri validation has severity 'high' because accepting arbitrary redirect URIs without server-side validation is a well-known OAuth security vulnerability",
-      "The agent decided to REQUEST_CHANGES (not APPROVE and not COMMENT) because there is at least one high-severity finding"
+      "The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/evals/security-redirect-uri-3.json b/review-pr/agents/evals/security-redirect-uri-3.json
@@ -9,7 +9,7 @@
       "The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
       "At least one finding flags the missing redirect_uri validation as a security concern in oidc.go or service.go",
       "The finding about redirect_uri validation has severity 'high' because accepting arbitrary redirect URIs without server-side validation is a well-known OAuth security vulnerability",
-      "The agent decided to REQUEST_CHANGES (not APPROVE and not COMMENT) because there is at least one high-severity finding"
+      "The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/evals/success-1.json b/review-pr/agents/evals/success-1.json
@@ -6,10 +6,10 @@
     "relevance": [
       "The agent ran 'echo $GITHUB_ACTIONS' before performing the review to detect the output mode",
       "The agent output the review to the console as formatted markdown instead of posting via gh api",
-      "The agent decided to APPROVE the PR",
+      "The review assessment label is '🟢 APPROVE' or '🟡 NEEDS ATTENTION' — it is NOT '🔴 CRITICAL'",
       "The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
       "No findings have severity 'high' with verdict 'CONFIRMED' or 'LIKELY'",
-      "The review summary does not request changes or flag critical issues"
+      "The review does not label any finding as CRITICAL or block-worthy"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/evals/success-2.json b/review-pr/agents/evals/success-2.json
@@ -6,10 +6,10 @@
     "relevance": [
       "The agent ran 'echo $GITHUB_ACTIONS' before performing the review to detect the output mode",
       "The agent output the review to the console as formatted markdown instead of posting via gh api",
-      "The agent decided to APPROVE the PR",
+      "The review assessment label is '🟢 APPROVE' or '🟡 NEEDS ATTENTION' — it is NOT '🔴 CRITICAL'",
       "The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
       "No findings have severity 'high' with verdict 'CONFIRMED' or 'LIKELY'",
-      "The review summary does not request changes or flag critical issues"
+      "The review does not label any finding as CRITICAL or block-worthy"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/evals/success-3.json b/review-pr/agents/evals/success-3.json
@@ -6,10 +6,10 @@
     "relevance": [
       "The agent ran 'echo $GITHUB_ACTIONS' before performing the review to detect the output mode",
       "The agent output the review to the console as formatted markdown instead of posting via gh api",
-      "The agent decided to APPROVE the PR",
+      "The review assessment label is '🟢 APPROVE' or '🟡 NEEDS ATTENTION' — it is NOT '🔴 CRITICAL'",
       "The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
       "No findings have severity 'high' with verdict 'CONFIRMED' or 'LIKELY'",
-      "The review summary does not request changes or flag critical issues"
+      "The review does not label any finding as CRITICAL or block-worthy"
     ]
   },
   "messages": [
diff --git a/review-pr/agents/pr-review.yaml b/review-pr/agents/pr-review.yaml
@@ -2,6 +2,7 @@ models:
   sonnet:
     provider: anthropic
     model: claude-sonnet-4-5
+    temperature: 0.0
     max_tokens: 8192
   haiku:
     provider: anthropic
@@ -346,17 +347,37 @@ agents:
         goroutine leaks, range var capture in closures, mutex copied by value,
         context not propagated, channel deadlocks, panic in library code
 
+      ## When in Doubt
+
+      Err on the side of reporting. A finding that a human reviewer dismisses costs them
+      seconds. A missed finding that reaches production can cost much more. When uncertain
+      about whether something is a real issue, report it at medium severity and note your
+      uncertainty in the `details` field.
+
+      However, "when in doubt" does not mean "invent scenarios." You must be able to describe
+      a concrete trigger path in production code. Do not flag:
+      - Test-only code paths or standard testing patterns (mocking, stubbing, test doubles)
+      - Variables that are only mutated in test files
+      - Hypothetical issues that require ignoring the Ignore list above
+
       ## Ignore
 
-      Style, formatting, naming, documentation, test files.
+      Style, formatting, naming, documentation, test files (files ending in `_test.go`,
+      `*.test.ts`, `*.spec.js`, `test_*.py`, or in `__tests__`/`tests`/`test` directories).
       Existing code not changed in this PR.
       Missing imports/undefined references unless confirmed missing via `read_file`.
+      Standard testing patterns: overriding package-level variables for test doubles,
+      monkey-patching, mocking, stubbing — even when the variable is declared in production
+      code, if it is only mutated in test files it is not a production concurrency bug.
 
       ## Severity
 
-      - **high**: WILL cause harm — data loss, security breach, crash, outage. Rare.
-      - **medium**: COULD cause issues — race conditions, resource leaks, edge cases. Default.
-      - **low**: Code smells. Rarely report.
+      - **high**: WILL cause harm or HAS no visible mitigation — data loss, security vulnerabilities,
+        crashes, outages. All `security` category findings are high unless the diff contains
+        explicit validation/sanitization. Do not assume external systems validate inputs.
+      - **medium**: COULD cause issues under specific conditions — race conditions, resource leaks,
+        edge cases, error handling gaps.
+      - **low**: Code smells, minor inefficiencies. Rarely report.
 
       ## Output
 
@@ -445,11 +466,26 @@ agents:
       etc.) before confirming that an API or language feature doesn't exist.
 
       Verify each one independently.
-      Your job is to filter out false positives. For each finding, check if:
+      Your job is to verify findings, not to filter them out. Default to LIKELY unless you have
+      concrete evidence to DISMISS. For each finding:
       - **THE CODE IS ACTUALLY CHANGED IN THIS PR** (if not, DISMISS immediately)
-      - The bug can actually happen given the surrounding code
-      - Existing safeguards already prevent it
-      - Tests cover this case
+      - Can you find explicit safeguards in the diff or source files that prevent the bug?
+        Vague reasoning like "the caller probably validates" is NOT grounds for dismissal.
+      - Do tests in the diff specifically cover this edge case? General test existence is not enough.
+
+      **DISMISS requires proof.** You must cite the specific code (file + line) that prevents
+      the bug. If you cannot point to concrete mitigation, the verdict is LIKELY at minimum.
+
+      **Security findings have a higher bar for dismissal.** Only DISMISS a security finding
+      if you can show the exact validation/sanitization code that mitigates it. Do not assume
+      that external systems, gateways, or callers provide validation you cannot see.
+
+      **DISMISS test-only patterns.** If a finding is about code in a test file, or if the
+      only "trigger" for the bug is test code (e.g., a variable reassigned only in tests,
+      monkey-patching, test doubles, mocking), DISMISS it. Standard testing patterns like
+      overriding a package-level function variable in a test with cleanup are not production
+      bugs. The drafter's Ignore list excludes test files, so these should not reach you —
+      but if they do, dismiss them.
 
       ## Reading Files for Context
 
@@ -481,9 +517,10 @@ agents:
       array. Return one verdict per finding you were given. For each verdict:
 
       - `verdict`: One of `"CONFIRMED"`, `"LIKELY"`, `"DISMISSED"`
-        - CONFIRMED: Definitely a bug in changed code
-        - LIKELY: Probably a bug in changed code
-        - DISMISSED: Not a bug OR not in changed code
+        - CONFIRMED: Bug verified — you found no mitigation in the source code
+        - LIKELY: Probable bug — you could not fully verify but found no evidence against it
+        - DISMISSED: Proven not a bug — you can cite the specific code that prevents it, OR
+          the finding is not in code changed by this PR
       - `file`: Preserve the file path from the drafter's finding
       - `line`: Preserve or correct the line number from the drafter's finding
       - `severity`: You may adjust severity from what the drafter assigned based on full context

Original file line number	Diff line number	Diff line change
`@@ -10,7 +10,7 @@`
`10`	`10`	`"At least one finding flags the missing redirect_uri validation as a security concern",`
`11`	`11`	`"The agent delegated all high and medium severity findings to the verifier in a single batch delegation, not multiple individual delegations",`
`12`	`12`	`"The verifier returned a JSON response with a 'verdicts' array containing one verdict per finding",`
`13`		`- "The agent decided to REQUEST_CHANGES because there is at least one high-severity confirmed or likely finding"`
	`13`	`+ "The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"`
`14`	`14`	`]`
`15`	`15`	`},`
`16`	`16`	`"messages": [`
Original file line number	Diff line number	Diff line change
`@@ -9,7 +9,7 @@`
`9`	`9`	`"The drafter response is valid JSON containing a 'findings' array and a 'summary' field",`
`10`	`10`	`"At least one finding flags the missing redirect_uri validation as a security concern in oidc.go or service.go",`
`11`	`11`	`"The finding about redirect_uri validation has severity 'high' because accepting arbitrary redirect URIs without server-side validation is a well-known OAuth security vulnerability",`
`12`		`- "The agent decided to REQUEST_CHANGES (not APPROVE and not COMMENT) because there is at least one high-severity finding"`
	`12`	`+ "The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"`
`13`	`13`	`]`
`14`	`14`	`},`
`15`	`15`	`"messages": [`