Skip to content

Commit bc18197

Browse files
authored
trying to make the reviewer more deterministic (#86)
1 parent 0635d86 commit bc18197

10 files changed

+60
-23
lines changed

review-pr/agents/evals/batched-verifier-security-1.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"At least one finding flags the missing redirect_uri validation as a security concern",
1111
"The agent delegated all high and medium severity findings to the verifier in a single batch delegation, not multiple individual delegations",
1212
"The verifier returned a JSON response with a 'verdicts' array containing one verdict per finding",
13-
"The agent decided to REQUEST_CHANGES because there is at least one high-severity confirmed or likely finding"
13+
"The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
1414
]
1515
},
1616
"messages": [

review-pr/agents/evals/batched-verifier-security-2.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"At least one finding flags the missing redirect_uri validation as a security concern",
1111
"The agent delegated all high and medium severity findings to the verifier in a single batch delegation, not multiple individual delegations",
1212
"The verifier returned a JSON response with a 'verdicts' array containing one verdict per finding",
13-
"The agent decided to REQUEST_CHANGES because there is at least one high-severity confirmed or likely finding"
13+
"The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
1414
]
1515
},
1616
"messages": [

review-pr/agents/evals/batched-verifier-security-3.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"At least one finding flags the missing redirect_uri validation as a security concern",
1111
"The agent delegated all high and medium severity findings to the verifier in a single batch delegation, not multiple individual delegations",
1212
"The verifier returned a JSON response with a 'verdicts' array containing one verdict per finding",
13-
"The agent decided to REQUEST_CHANGES because there is at least one high-severity confirmed or likely finding"
13+
"The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
1414
]
1515
},
1616
"messages": [

review-pr/agents/evals/security-redirect-uri-1.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
"The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
1010
"At least one finding flags the missing redirect_uri validation as a security concern in oidc.go or service.go",
1111
"The finding about redirect_uri validation has severity 'high' because accepting arbitrary redirect URIs without server-side validation is a well-known OAuth security vulnerability",
12-
"The agent decided to REQUEST_CHANGES (not APPROVE and not COMMENT) because there is at least one high-severity finding"
12+
"The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
1313
]
1414
},
1515
"messages": [

review-pr/agents/evals/security-redirect-uri-2.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
"The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
1010
"At least one finding flags the missing redirect_uri validation as a security concern in oidc.go or service.go",
1111
"The finding about redirect_uri validation has severity 'high' because accepting arbitrary redirect URIs without server-side validation is a well-known OAuth security vulnerability",
12-
"The agent decided to REQUEST_CHANGES (not APPROVE and not COMMENT) because there is at least one high-severity finding"
12+
"The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
1313
]
1414
},
1515
"messages": [

review-pr/agents/evals/security-redirect-uri-3.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
"The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
1010
"At least one finding flags the missing redirect_uri validation as a security concern in oidc.go or service.go",
1111
"The finding about redirect_uri validation has severity 'high' because accepting arbitrary redirect URIs without server-side validation is a well-known OAuth security vulnerability",
12-
"The agent decided to REQUEST_CHANGES (not APPROVE and not COMMENT) because there is at least one high-severity finding"
12+
"The review assessment label is '🔴 CRITICAL' because there is at least one high-severity confirmed or likely finding"
1313
]
1414
},
1515
"messages": [

review-pr/agents/evals/success-1.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@
66
"relevance": [
77
"The agent ran 'echo $GITHUB_ACTIONS' before performing the review to detect the output mode",
88
"The agent output the review to the console as formatted markdown instead of posting via gh api",
9-
"The agent decided to APPROVE the PR",
9+
"The review assessment label is '🟢 APPROVE' or '🟡 NEEDS ATTENTION' — it is NOT '🔴 CRITICAL'",
1010
"The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
1111
"No findings have severity 'high' with verdict 'CONFIRMED' or 'LIKELY'",
12-
"The review summary does not request changes or flag critical issues"
12+
"The review does not label any finding as CRITICAL or block-worthy"
1313
]
1414
},
1515
"messages": [

review-pr/agents/evals/success-2.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@
66
"relevance": [
77
"The agent ran 'echo $GITHUB_ACTIONS' before performing the review to detect the output mode",
88
"The agent output the review to the console as formatted markdown instead of posting via gh api",
9-
"The agent decided to APPROVE the PR",
9+
"The review assessment label is '🟢 APPROVE' or '🟡 NEEDS ATTENTION' — it is NOT '🔴 CRITICAL'",
1010
"The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
1111
"No findings have severity 'high' with verdict 'CONFIRMED' or 'LIKELY'",
12-
"The review summary does not request changes or flag critical issues"
12+
"The review does not label any finding as CRITICAL or block-worthy"
1313
]
1414
},
1515
"messages": [

review-pr/agents/evals/success-3.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@
66
"relevance": [
77
"The agent ran 'echo $GITHUB_ACTIONS' before performing the review to detect the output mode",
88
"The agent output the review to the console as formatted markdown instead of posting via gh api",
9-
"The agent decided to APPROVE the PR",
9+
"The review assessment label is '🟢 APPROVE' or '🟡 NEEDS ATTENTION' — it is NOT '🔴 CRITICAL'",
1010
"The drafter response is valid JSON containing a 'findings' array and a 'summary' field",
1111
"No findings have severity 'high' with verdict 'CONFIRMED' or 'LIKELY'",
12-
"The review summary does not request changes or flag critical issues"
12+
"The review does not label any finding as CRITICAL or block-worthy"
1313
]
1414
},
1515
"messages": [

review-pr/agents/pr-review.yaml

Lines changed: 48 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ models:
22
sonnet:
33
provider: anthropic
44
model: claude-sonnet-4-5
5+
temperature: 0.0
56
max_tokens: 8192
67
haiku:
78
provider: anthropic
@@ -346,17 +347,37 @@ agents:
346347
goroutine leaks, range var capture in closures, mutex copied by value,
347348
context not propagated, channel deadlocks, panic in library code
348349
350+
## When in Doubt
351+
352+
Err on the side of reporting. A finding that a human reviewer dismisses costs them
353+
seconds. A missed finding that reaches production can cost much more. When uncertain
354+
about whether something is a real issue, report it at medium severity and note your
355+
uncertainty in the `details` field.
356+
357+
However, "when in doubt" does not mean "invent scenarios." You must be able to describe
358+
a concrete trigger path in production code. Do not flag:
359+
- Test-only code paths or standard testing patterns (mocking, stubbing, test doubles)
360+
- Variables that are only mutated in test files
361+
- Hypothetical issues that require ignoring the Ignore list above
362+
349363
## Ignore
350364
351-
Style, formatting, naming, documentation, test files.
365+
Style, formatting, naming, documentation, test files (files ending in `_test.go`,
366+
`*.test.ts`, `*.spec.js`, `test_*.py`, or in `__tests__`/`tests`/`test` directories).
352367
Existing code not changed in this PR.
353368
Missing imports/undefined references unless confirmed missing via `read_file`.
369+
Standard testing patterns: overriding package-level variables for test doubles,
370+
monkey-patching, mocking, stubbing — even when the variable is declared in production
371+
code, if it is only mutated in test files it is not a production concurrency bug.
354372
355373
## Severity
356374
357-
- **high**: WILL cause harm — data loss, security breach, crash, outage. Rare.
358-
- **medium**: COULD cause issues — race conditions, resource leaks, edge cases. Default.
359-
- **low**: Code smells. Rarely report.
375+
- **high**: WILL cause harm or HAS no visible mitigation — data loss, security vulnerabilities,
376+
crashes, outages. All `security` category findings are high unless the diff contains
377+
explicit validation/sanitization. Do not assume external systems validate inputs.
378+
- **medium**: COULD cause issues under specific conditions — race conditions, resource leaks,
379+
edge cases, error handling gaps.
380+
- **low**: Code smells, minor inefficiencies. Rarely report.
360381
361382
## Output
362383
@@ -445,11 +466,26 @@ agents:
445466
etc.) before confirming that an API or language feature doesn't exist.
446467
447468
Verify each one independently.
448-
Your job is to filter out false positives. For each finding, check if:
469+
Your job is to verify findings, not to filter them out. Default to LIKELY unless you have
470+
concrete evidence to DISMISS. For each finding:
449471
- **THE CODE IS ACTUALLY CHANGED IN THIS PR** (if not, DISMISS immediately)
450-
- The bug can actually happen given the surrounding code
451-
- Existing safeguards already prevent it
452-
- Tests cover this case
472+
- Can you find explicit safeguards in the diff or source files that prevent the bug?
473+
Vague reasoning like "the caller probably validates" is NOT grounds for dismissal.
474+
- Do tests in the diff specifically cover this edge case? General test existence is not enough.
475+
476+
**DISMISS requires proof.** You must cite the specific code (file + line) that prevents
477+
the bug. If you cannot point to concrete mitigation, the verdict is LIKELY at minimum.
478+
479+
**Security findings have a higher bar for dismissal.** Only DISMISS a security finding
480+
if you can show the exact validation/sanitization code that mitigates it. Do not assume
481+
that external systems, gateways, or callers provide validation you cannot see.
482+
483+
**DISMISS test-only patterns.** If a finding is about code in a test file, or if the
484+
only "trigger" for the bug is test code (e.g., a variable reassigned only in tests,
485+
monkey-patching, test doubles, mocking), DISMISS it. Standard testing patterns like
486+
overriding a package-level function variable in a test with cleanup are not production
487+
bugs. The drafter's Ignore list excludes test files, so these should not reach you —
488+
but if they do, dismiss them.
453489
454490
## Reading Files for Context
455491
@@ -481,9 +517,10 @@ agents:
481517
array. Return one verdict per finding you were given. For each verdict:
482518
483519
- `verdict`: One of `"CONFIRMED"`, `"LIKELY"`, `"DISMISSED"`
484-
- CONFIRMED: Definitely a bug in changed code
485-
- LIKELY: Probably a bug in changed code
486-
- DISMISSED: Not a bug OR not in changed code
520+
- CONFIRMED: Bug verified — you found no mitigation in the source code
521+
- LIKELY: Probable bug — you could not fully verify but found no evidence against it
522+
- DISMISSED: Proven not a bug — you can cite the specific code that prevents it, OR
523+
the finding is not in code changed by this PR
487524
- `file`: Preserve the file path from the drafter's finding
488525
- `line`: Preserve or correct the line number from the drafter's finding
489526
- `severity`: You may adjust severity from what the drafter assigned based on full context

0 commit comments

Comments
 (0)