What happened
On PR #531, the review agent flagged k8s.io version skew in e2e-tests/go.mod approximately 10 times across different review runs. The finding was substantively identical each time (same modules, same version mismatch), but the assigned severity varied from low to critical across runs. For example, the same k8s.io staging module skew pattern was rated low (api-contract) in one run, medium in another, and critical in a third, despite the diff being unchanged or near-identical after rebases.
What could go better
Severity inconsistency erodes trust in the review agent's judgment. If a human sees the same issue rated critical one day and low the next on the same code, they learn to discount all severity labels. Issue #2746 addresses non-deterministic verdicts (approve vs request changes) on unchanged commits, but does not specifically address severity-level instability for individual findings. The root cause is likely that severity is determined by the LLM without anchoring to prior assessments of the same pattern. Confidence: medium — the severity variation could partly reflect genuine differences in the surrounding diff context after rebases, but the core finding (k8s.io version skew) was identical.
Proposed change
In the review agent's finding-generation prompt or post-processing logic, add severity anchoring for recurring findings. When the agent detects a finding that matches a prior finding on the same PR (same file, same or overlapping diff region, same category tag), it should either (a) reuse the prior severity unless it can articulate a specific reason for the change, or (b) surface the prior severity as context in its prompt so the LLM can make a deliberate decision rather than an independent re-assessment. This could be implemented in the review agent definition or in a review post-processing skill. The simplest approach: when generating the review, include a summary of prior findings from the sticky comment as context, with their severities, and instruct the agent to maintain severity consistency unless the code has materially changed.
Validation criteria
On the next PR where the review agent produces the same finding across 3+ review iterations, the severity rating should be consistent (same level) or, if changed, accompanied by an explicit justification referencing what changed. Verify by auditing severity labels on 5 PRs with multiple review iterations and confirming <10% unjustified severity drift for recurring findings.
Generated by retro agent from konflux-ci/build-service#531
What happened
On PR #531, the review agent flagged k8s.io version skew in
e2e-tests/go.modapproximately 10 times across different review runs. The finding was substantively identical each time (same modules, same version mismatch), but the assigned severity varied fromlowtocriticalacross runs. For example, the same k8s.io staging module skew pattern was ratedlow(api-contract) in one run,mediumin another, andcriticalin a third, despite the diff being unchanged or near-identical after rebases.What could go better
Severity inconsistency erodes trust in the review agent's judgment. If a human sees the same issue rated
criticalone day andlowthe next on the same code, they learn to discount all severity labels. Issue #2746 addresses non-deterministic verdicts (approve vs request changes) on unchanged commits, but does not specifically address severity-level instability for individual findings. The root cause is likely that severity is determined by the LLM without anchoring to prior assessments of the same pattern. Confidence: medium — the severity variation could partly reflect genuine differences in the surrounding diff context after rebases, but the core finding (k8s.io version skew) was identical.Proposed change
In the review agent's finding-generation prompt or post-processing logic, add severity anchoring for recurring findings. When the agent detects a finding that matches a prior finding on the same PR (same file, same or overlapping diff region, same category tag), it should either (a) reuse the prior severity unless it can articulate a specific reason for the change, or (b) surface the prior severity as context in its prompt so the LLM can make a deliberate decision rather than an independent re-assessment. This could be implemented in the review agent definition or in a review post-processing skill. The simplest approach: when generating the review, include a summary of prior findings from the sticky comment as context, with their severities, and instruct the agent to maintain severity consistency unless the code has materially changed.
Validation criteria
On the next PR where the review agent produces the same finding across 3+ review iterations, the severity rating should be consistent (same level) or, if changed, accompanied by an explicit justification referencing what changed. Verify by auditing severity labels on 5 PRs with multiple review iterations and confirming <10% unjustified severity drift for recurring findings.
Generated by retro agent from konflux-ci/build-service#531