Slow podlatency#111
Draft
afcollins wants to merge 8 commits into
Draft
Conversation
Adds a `node-rca <prow_url>` Slack command that deterministically identifies the slowest pod from a run's podLatencyMeasurement JSON, downloads the outlier node's gzipped systemd journal, and returns a structured markdown RCA report. - node_log_analyzer.py: parses journal for PLEG detection lag, housekeeping overruns, PLEG silence gaps, SLO distribution, and peak concurrency; also usable as a standalone CLI (--log, --pod, --json) - log_summarizer.py: adds GCS discovery helpers (find_pod_latency_file, download_pod_latency_file, parse_slowest_pod, download_node_journal) that walk variable artifact paths and decompress gzip journals without .gz suffix - log_analyzer.py: adds run_node_rca_analysis() orchestrating the full pipeline - slack_fetcher.py: detects "node-rca" keyword and dispatches to RCA pipeline Co-Authored-By: Claude Sonnet 4.6
Signed-off-by: Andrew Collins <ancollin@redhat.com>
Replace manual `node-rca` Slack keyword with automatic detection: when orion changepoint analysis reports a podReadyLatency_P99 regression in errors_list or full_errors_for_file, the node journal RCA runs automatically after job-history is posted in the thread. Co-Authored-By: Claude Sonnet 4.6
15 tests covering: - pod UID extraction, timeline ordering, PLEG lag calculation (pause + app), housekeeping overrun count/peak, PLEG silence gaps, SLO stats - format_result_markdown section presence, lag/overrun values in output, root cause and KEP-3386 in summary - test_slack_block_structure: verifies the exact two-block Slack payload (mrkdwn header + markdown content) produced when posting an RCA report, including key content visible to users (lag, overrun count, root cause) Co-Authored-By: Claude Sonnet 4.6
…log_analyzer Resolves ruff F401 (unused imports) and E741 (ambiguous variable names) introduced by the test file. All pre-existing lint errors are unrelated. Co-Authored-By: Claude Sonnet 4.6
… lookup
Three bugs fixed:
- find_pod_latency_file: `e.rstrip("/").endswith("/")` was always False;
fixed to `e.endswith("/")` for both qe_dirs and metrics_dirs filters
- find_pod_latency_file: add step_hint param to prefer the workload dir
matching the orion regression (e.g. "node-density-cni") over unrelated ones
- run_node_rca_analysis: accept and forward step_hint
- slack_fetcher: extract workload name from "[workload]" bracket in errors_list
and pass as step_hint so the right metrics file is found
Co-Authored-By: Claude Sonnet 4.6
…message Avoids Slack's 3000-char block limit. Now posts a brief header message then uploads the full report as node-rca.md via files_upload_v2. Test updated to assert file content rather than block structure. Co-Authored-By: Claude Sonnet 4.6
Co-Authored-By: Claude Opus 4.6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a workload has an unusually high podReady latency, this bot command will fetch the slowest pod, analyze the node's log, and produce a breakdown of what was happening on the node during the duration of the slowest pod.