Skip to content

Stabilize severity ratings for recurring findings across re-review runs #2993

Description

@fullsend-ai-retro

What happened

On PR #531, the review agent flagged k8s.io version skew in e2e-tests/go.mod approximately 10 times across different review runs. The finding was substantively identical each time (same modules, same version mismatch), but the assigned severity varied from low to critical across runs. For example, the same k8s.io staging module skew pattern was rated low (api-contract) in one run, medium in another, and critical in a third, despite the diff being unchanged or near-identical after rebases.

What could go better

Severity inconsistency erodes trust in the review agent's judgment. If a human sees the same issue rated critical one day and low the next on the same code, they learn to discount all severity labels. Issue #2746 addresses non-deterministic verdicts (approve vs request changes) on unchanged commits, but does not specifically address severity-level instability for individual findings. The root cause is likely that severity is determined by the LLM without anchoring to prior assessments of the same pattern. Confidence: medium — the severity variation could partly reflect genuine differences in the surrounding diff context after rebases, but the core finding (k8s.io version skew) was identical.

Proposed change

In the review agent's finding-generation prompt or post-processing logic, add severity anchoring for recurring findings. When the agent detects a finding that matches a prior finding on the same PR (same file, same or overlapping diff region, same category tag), it should either (a) reuse the prior severity unless it can articulate a specific reason for the change, or (b) surface the prior severity as context in its prompt so the LLM can make a deliberate decision rather than an independent re-assessment. This could be implemented in the review agent definition or in a review post-processing skill. The simplest approach: when generating the review, include a summary of prior findings from the sticky comment as context, with their severities, and instruct the agent to maintain severity consistency unless the code has materially changed.

Validation criteria

On the next PR where the review agent produces the same finding across 3+ review iterations, the severity rating should be consistent (same level) or, if changed, accompanied by an explicit justification referencing what changed. Verify by auditing severity labels on 5 PRs with multiple review iterations and confirming <10% unjustified severity drift for recurring findings.


Generated by retro agent from konflux-ci/build-service#531

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions