trying to make the reviewer more deterministic#86
Merged
derekmisler merged 1 commit intodocker:mainfrom Mar 12, 2026
Merged
Conversation
Signed-off-by: Derek Misler <derek.misler@docker.com>
Contributor
Author
|
/describe |
Contributor
|
✅ PR description has been generated and updated! |
Contributor
There was a problem hiding this comment.
Review Summary
Assessment: 🟢 APPROVE
This PR updates evaluation test configurations and agent documentation to align with the new comment-only review policy. The changes are configuration and documentation only, with no production code modifications.
Changes Overview
The PR makes the following changes:
- Evaluation files: Updates expected success criteria from checking specific GitHub review events (
REQUEST_CHANGES,APPROVE) to checking assessment labels (🔴 CRITICAL,🟢 APPROVE,🟡 NEEDS ATTENTION) - Agent configuration: Adds
temperature: 0.0to the sonnet model configuration for more deterministic behavior - Documentation: Expands guidance for the drafter and verifier agents, including clearer severity definitions, verification standards, and handling of test-only code patterns
Findings
No bugs found in the changed code. All modifications are to test/evaluation configurations and documentation, which do not affect runtime behavior.
✅ This PR is ready to merge — no issues detected in the changed code.
trungutt
approved these changes
Mar 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR tightens the PR reviewer agent's prompts to produce more consistent, deterministic outputs. It sets
temperature: 0.0on the Sonnet model, sharpens the drafter and verifier instructions to reduce ambiguity, and updates eval criteria to match the new output format (emoji-labeled assessment labels instead ofREQUEST_CHANGES/APPROVEaction names).Changes
review-pr/agents/pr-review.yaml: Setstemperature: 0.0on the Sonnet model. Strengthens drafter instructions with a "When in Doubt" section, more precise severity definitions (security findings default to high), and clearer test-file ignore rules. Rewrites verifier instructions to default toLIKELYinstead of filtering toward dismissal —DISMISSnow requires citing specific mitigation code, with a higher bar for security findings.review-pr/agents/evals/security-redirect-uri-{1,2,3}.jsonandbatched-verifier-security-{1,2,3}.json: Updated eval criteria from"The agent decided to REQUEST_CHANGES"to"The review assessment label is '🔴 CRITICAL'"to match the actual output format the agent now produces.review-pr/agents/evals/success-{1,2,3}.json: Updated eval criteria from"The agent decided to APPROVE"to checking for'🟢 APPROVE'or'🟡 NEEDS ATTENTION'labels, and replaced the vague "does not request changes" check with"does not label any finding as CRITICAL or block-worthy".I wasn't able to get a perfect 100% on relevancy, but at least it errs on the over-communication, rather than dismissing an issue and stamping it with
'🟢 APPROVE'How to Test
pr-review.yamland verify all 9 eval scenarios pass consistently across multiple runs.🔴 CRITICALlabel without needing multiple retries.🟢 APPROVEor🟡 NEEDS ATTENTIONand do not flip to critical on re-runs.Closes: https://github.com/docker/gordon/issues/200