Skip to content

trying to make the reviewer more deterministic#86

Merged
derekmisler merged 1 commit intodocker:mainfrom
derekmisler:trying-to-make-the-reviewer-more-deterministic
Mar 12, 2026
Merged

trying to make the reviewer more deterministic#86
derekmisler merged 1 commit intodocker:mainfrom
derekmisler:trying-to-make-the-reviewer-more-deterministic

Conversation

@derekmisler
Copy link
Copy Markdown
Contributor

@derekmisler derekmisler commented Mar 11, 2026

Summary

This PR tightens the PR reviewer agent's prompts to produce more consistent, deterministic outputs. It sets temperature: 0.0 on the Sonnet model, sharpens the drafter and verifier instructions to reduce ambiguity, and updates eval criteria to match the new output format (emoji-labeled assessment labels instead of REQUEST_CHANGES/APPROVE action names).

Changes

  • review-pr/agents/pr-review.yaml: Sets temperature: 0.0 on the Sonnet model. Strengthens drafter instructions with a "When in Doubt" section, more precise severity definitions (security findings default to high), and clearer test-file ignore rules. Rewrites verifier instructions to default to LIKELY instead of filtering toward dismissal — DISMISS now requires citing specific mitigation code, with a higher bar for security findings.
  • review-pr/agents/evals/security-redirect-uri-{1,2,3}.json and batched-verifier-security-{1,2,3}.json: Updated eval criteria from "The agent decided to REQUEST_CHANGES" to "The review assessment label is '🔴 CRITICAL'" to match the actual output format the agent now produces.
  • review-pr/agents/evals/success-{1,2,3}.json: Updated eval criteria from "The agent decided to APPROVE" to checking for '🟢 APPROVE' or '🟡 NEEDS ATTENTION' labels, and replaced the vague "does not request changes" check with "does not label any finding as CRITICAL or block-worthy".

I wasn't able to get a perfect 100% on relevancy, but at least it errs on the over-communication, rather than dismissing an issue and stamping it with '🟢 APPROVE'

Screenshot 2026-03-11 at 1 08 06 PM

How to Test

  • Run the eval suite against the updated pr-review.yaml and verify all 9 eval scenarios pass consistently across multiple runs.
  • Confirm that security-related PRs (e.g., the redirect URI scenarios) reliably produce a 🔴 CRITICAL label without needing multiple retries.
  • Confirm that clean PRs (success scenarios) reliably produce 🟢 APPROVE or 🟡 NEEDS ATTENTION and do not flip to critical on re-runs.

Closes: https://github.com/docker/gordon/issues/200

Signed-off-by: Derek Misler <derek.misler@docker.com>
@derekmisler
Copy link
Copy Markdown
Contributor Author

/describe

@derekmisler derekmisler marked this pull request as ready for review March 11, 2026 17:08
@derekmisler derekmisler requested a review from a team as a code owner March 11, 2026 17:08
@docker-agent
Copy link
Copy Markdown
Contributor

docker-agent bot commented Mar 11, 2026

✅ PR description has been generated and updated!

Copy link
Copy Markdown
Contributor

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Assessment: 🟢 APPROVE

This PR updates evaluation test configurations and agent documentation to align with the new comment-only review policy. The changes are configuration and documentation only, with no production code modifications.

Changes Overview

The PR makes the following changes:

  • Evaluation files: Updates expected success criteria from checking specific GitHub review events (REQUEST_CHANGES, APPROVE) to checking assessment labels (🔴 CRITICAL, 🟢 APPROVE, 🟡 NEEDS ATTENTION)
  • Agent configuration: Adds temperature: 0.0 to the sonnet model configuration for more deterministic behavior
  • Documentation: Expands guidance for the drafter and verifier agents, including clearer severity definitions, verification standards, and handling of test-only code patterns

Findings

No bugs found in the changed code. All modifications are to test/evaluation configurations and documentation, which do not affect runtime behavior.


This PR is ready to merge — no issues detected in the changed code.

@derekmisler derekmisler self-assigned this Mar 11, 2026
@derekmisler derekmisler merged commit bc18197 into docker:main Mar 12, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants