Slow podlatency by afcollins · Pull Request #111 · redhat-performance/BugZooka

afcollins · 2026-05-29T22:37:53Z

When a workload has an unusually high podReady latency, this bot command will fetch the slowest pod, analyze the node's log, and produce a breakdown of what was happening on the node during the duration of the slowest pod.

Adds a `node-rca <prow_url>` Slack command that deterministically identifies the slowest pod from a run's podLatencyMeasurement JSON, downloads the outlier node's gzipped systemd journal, and returns a structured markdown RCA report. - node_log_analyzer.py: parses journal for PLEG detection lag, housekeeping overruns, PLEG silence gaps, SLO distribution, and peak concurrency; also usable as a standalone CLI (--log, --pod, --json) - log_summarizer.py: adds GCS discovery helpers (find_pod_latency_file, download_pod_latency_file, parse_slowest_pod, download_node_journal) that walk variable artifact paths and decompress gzip journals without .gz suffix - log_analyzer.py: adds run_node_rca_analysis() orchestrating the full pipeline - slack_fetcher.py: detects "node-rca" keyword and dispatches to RCA pipeline Co-Authored-By: Claude Sonnet 4.6

Signed-off-by: Andrew Collins <ancollin@redhat.com>

Replace manual `node-rca` Slack keyword with automatic detection: when orion changepoint analysis reports a podReadyLatency_P99 regression in errors_list or full_errors_for_file, the node journal RCA runs automatically after job-history is posted in the thread. Co-Authored-By: Claude Sonnet 4.6

15 tests covering: - pod UID extraction, timeline ordering, PLEG lag calculation (pause + app), housekeeping overrun count/peak, PLEG silence gaps, SLO stats - format_result_markdown section presence, lag/overrun values in output, root cause and KEP-3386 in summary - test_slack_block_structure: verifies the exact two-block Slack payload (mrkdwn header + markdown content) produced when posting an RCA report, including key content visible to users (lag, overrun count, root cause) Co-Authored-By: Claude Sonnet 4.6

…log_analyzer Resolves ruff F401 (unused imports) and E741 (ambiguous variable names) introduced by the test file. All pre-existing lint errors are unrelated. Co-Authored-By: Claude Sonnet 4.6

… lookup Three bugs fixed: - find_pod_latency_file: `e.rstrip("/").endswith("/")` was always False; fixed to `e.endswith("/")` for both qe_dirs and metrics_dirs filters - find_pod_latency_file: add step_hint param to prefer the workload dir matching the orion regression (e.g. "node-density-cni") over unrelated ones - run_node_rca_analysis: accept and forward step_hint - slack_fetcher: extract workload name from "[workload]" bracket in errors_list and pass as step_hint so the right metrics file is found Co-Authored-By: Claude Sonnet 4.6

…message Avoids Slack's 3000-char block limit. Now posts a brief header message then uploads the full report as node-rca.md via files_upload_v2. Test updated to assert file content rather than block structure. Co-Authored-By: Claude Sonnet 4.6

Co-Authored-By: Claude Opus 4.6

afcollins added 8 commits May 29, 2026 15:24

satisfying linter

6b7132b

Signed-off-by: Andrew Collins <ancollin@redhat.com>

fix: remove unused imports and ambiguous variable names in test_node_…

0795b91

…log_analyzer Resolves ruff F401 (unused imports) and E741 (ambiguous variable names) introduced by the test file. All pre-existing lint errors are unrelated. Co-Authored-By: Claude Sonnet 4.6

fix: drop gsutil -m and -r flags from single-file GCS download

c249dc6

Co-Authored-By: Claude Opus 4.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow podlatency#111

Slow podlatency#111
afcollins wants to merge 8 commits into
redhat-performance:mainfrom
afcollins:slow-podlatency

afcollins commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

afcollins commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant