What happened
On PR #2761 the review agent identified two medium-severity findings: (1) RETRO_COMMENT losing its default-to-empty semantics due to the migration from shell-based ${RETRO_COMMENT:-} to YAML ${RETRO_COMMENT} (behavior-change category), and (2) scaffold_test.go not updated to verify keys in Env.Runner (test-integrity category). Both findings were submitted as COMMENTED reviews. The human reviewer approved without engaging with either finding. The PR was merged with both issues unresolved. The RETRO_COMMENT finding was independently confirmed by Qodo's review bot.
What could go better
When the review agent posts medium-severity findings as COMMENTED rather than CHANGES_REQUESTED, they blend into the PR timeline and are easily overlooked — especially when a human reviewer is scanning for blocking issues. On this PR, the agent's findings were more substantive than the human review, yet they had no influence on the merge decision. Escalating medium-severity findings in high-confidence categories (behavior-change, correctness, test-integrity) to CHANGES_REQUESTED would make them visible in the GitHub review UI's merge readiness indicator. Confidence is medium — this is a single PR observation and the pattern may not generalize. The risk is false-positive blocking if medium-severity thresholds are too sensitive.
Proposed change
Evaluate the review agent's verdict logic (likely in the review skill or post-script that translates findings into GitHub review status) to determine whether medium-severity findings in behavior-change, correctness, and test-integrity categories should produce a CHANGES_REQUESTED review instead of COMMENTED. Scope narrowly to these high-confidence categories to avoid false-positive blocking on style, documentation-drift, or redundancy findings. Consider a shadow-mode trial where the agent logs what it would have escalated without actually blocking.
Validation criteria
On the next 10 PRs where the review agent identifies a medium-severity behavior-change or test-integrity finding, track whether the finding was addressed before merge. Compare the resolution rate to the pre-change baseline (where this PR serves as one data point of 0% resolution). Target: 80%+ of medium-severity findings in these categories are explicitly acknowledged (fixed, dismissed with rationale, or overridden) before merge. Monitor for false-positive blocks that waste author time — if more than 20% of escalations are dismissed as not-actionable, narrow the criteria.
Generated by retro agent from #2761
What happened
On PR #2761 the review agent identified two medium-severity findings: (1) RETRO_COMMENT losing its default-to-empty semantics due to the migration from shell-based
${RETRO_COMMENT:-}to YAML${RETRO_COMMENT}(behavior-change category), and (2) scaffold_test.go not updated to verify keys inEnv.Runner(test-integrity category). Both findings were submitted as COMMENTED reviews. The human reviewer approved without engaging with either finding. The PR was merged with both issues unresolved. The RETRO_COMMENT finding was independently confirmed by Qodo's review bot.What could go better
When the review agent posts medium-severity findings as COMMENTED rather than CHANGES_REQUESTED, they blend into the PR timeline and are easily overlooked — especially when a human reviewer is scanning for blocking issues. On this PR, the agent's findings were more substantive than the human review, yet they had no influence on the merge decision. Escalating medium-severity findings in high-confidence categories (behavior-change, correctness, test-integrity) to CHANGES_REQUESTED would make them visible in the GitHub review UI's merge readiness indicator. Confidence is medium — this is a single PR observation and the pattern may not generalize. The risk is false-positive blocking if medium-severity thresholds are too sensitive.
Proposed change
Evaluate the review agent's verdict logic (likely in the review skill or post-script that translates findings into GitHub review status) to determine whether medium-severity findings in behavior-change, correctness, and test-integrity categories should produce a CHANGES_REQUESTED review instead of COMMENTED. Scope narrowly to these high-confidence categories to avoid false-positive blocking on style, documentation-drift, or redundancy findings. Consider a shadow-mode trial where the agent logs what it would have escalated without actually blocking.
Validation criteria
On the next 10 PRs where the review agent identifies a medium-severity behavior-change or test-integrity finding, track whether the finding was addressed before merge. Compare the resolution rate to the pre-change baseline (where this PR serves as one data point of 0% resolution). Target: 80%+ of medium-severity findings in these categories are explicitly acknowledged (fixed, dismissed with rationale, or overridden) before merge. Monitor for false-positive blocks that waste author time — if more than 20% of escalations are dismissed as not-actionable, narrow the criteria.
Generated by retro agent from #2761