feat(seer): Suppress re-triage of skipped issues in night shift#114915
Open
feat(seer): Suppress re-triage of skipped issues in night shift#114915
Conversation
Persist SKIP verdicts to a Redis cache keyed by group id with a 3.5-day TTL, then exclude those ids from candidate selection on subsequent nightly runs. Stops the agent from repeatedly re-evaluating the same issues it already classified as not worth fixing, saving compute and quota. The TTL is padded past 3 days so nightly-run jitter cannot expire a key right at the boundary; this guarantees the next 3 runs suppress the issue. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The old name suggested filtering out recently-skipped ids, but the function actually returns the subset that ARE recently skipped. Rename so the name matches the return value. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Mark the group via mark_skipped() before the run so the test exercises the real read path through Redis instead of stubbing recently_skipped. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Comment on lines
+195
to
+197
| for v in triage_response.verdicts: | ||
| if v.group_id in groups_by_id and v.action == TriageAction.SKIP: | ||
| mark_skipped(v.group_id) |
Contributor
There was a problem hiding this comment.
Bug: A Redis connection failure in mark_skipped after the main agent logic will cause an unhandled exception, discarding all previously computed triage results.
Severity: MEDIUM
Suggested Fix
Wrap the mark_skipped call in its own try/except block to catch potential Redis connection errors. Log the error for observability but do not re-raise it, allowing the function to return the successfully computed triage results. This ensures that failures in the caching optimization do not cause the loss of primary results.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/sentry/tasks/seer/night_shift/agentic_triage.py#L195-L197
Potential issue: The `mark_skipped` function is called outside the `try/except` block
that wraps the expensive Seer agent interactions. If a Redis connection error occurs
during this call, the exception is not handled locally. It propagates up to the
`run_night_shift_execution` function, which then marks the entire run as failed and
discards all the triage results (e.g., `AUTOFIX`, `ROOT_CAUSE_ONLY`) that were
successfully generated by the agent. This wastes significant LLM computation due to a
failure in a non-critical optimization step.
Did we get this right? 👍 / 👎 to inform future reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Persist SKIP verdicts from night-shift triage to Redis with a 3.5-day TTL, then exclude those group ids from candidate selection on subsequent nightly runs. Stops the agent from repeatedly re-evaluating issues it already classified as not worth fixing. The TTL exists at all because it's possible we may get new information in a few days (better tag distribution, new recommended event, etc) so we do eventually want to re-run our triage against it.
The TTL is padded past 3 days so nightly-run jitter cannot expire a key right at the boundary, guaranteeing suppression for the next 3 runs.