Skip to content

Slow podlatency#111

Draft
afcollins wants to merge 8 commits into
redhat-performance:mainfrom
afcollins:slow-podlatency
Draft

Slow podlatency#111
afcollins wants to merge 8 commits into
redhat-performance:mainfrom
afcollins:slow-podlatency

Conversation

@afcollins

Copy link
Copy Markdown
Member

When a workload has an unusually high podReady latency, this bot command will fetch the slowest pod, analyze the node's log, and produce a breakdown of what was happening on the node during the duration of the slowest pod.

afcollins added 8 commits May 29, 2026 15:24
Adds a `node-rca <prow_url>` Slack command that deterministically identifies
the slowest pod from a run's podLatencyMeasurement JSON, downloads the outlier
node's gzipped systemd journal, and returns a structured markdown RCA report.

- node_log_analyzer.py: parses journal for PLEG detection lag, housekeeping
  overruns, PLEG silence gaps, SLO distribution, and peak concurrency; also
  usable as a standalone CLI (--log, --pod, --json)
- log_summarizer.py: adds GCS discovery helpers (find_pod_latency_file,
  download_pod_latency_file, parse_slowest_pod, download_node_journal) that
  walk variable artifact paths and decompress gzip journals without .gz suffix
- log_analyzer.py: adds run_node_rca_analysis() orchestrating the full pipeline
- slack_fetcher.py: detects "node-rca" keyword and dispatches to RCA pipeline

Co-Authored-By: Claude Sonnet 4.6
Signed-off-by: Andrew Collins <ancollin@redhat.com>
Replace manual `node-rca` Slack keyword with automatic detection: when
orion changepoint analysis reports a podReadyLatency_P99 regression in
errors_list or full_errors_for_file, the node journal RCA runs automatically
after job-history is posted in the thread.

Co-Authored-By: Claude Sonnet 4.6
15 tests covering:
- pod UID extraction, timeline ordering, PLEG lag calculation (pause + app),
  housekeeping overrun count/peak, PLEG silence gaps, SLO stats
- format_result_markdown section presence, lag/overrun values in output,
  root cause and KEP-3386 in summary
- test_slack_block_structure: verifies the exact two-block Slack payload
  (mrkdwn header + markdown content) produced when posting an RCA report,
  including key content visible to users (lag, overrun count, root cause)

Co-Authored-By: Claude Sonnet 4.6
…log_analyzer

Resolves ruff F401 (unused imports) and E741 (ambiguous variable names)
introduced by the test file. All pre-existing lint errors are unrelated.

Co-Authored-By: Claude Sonnet 4.6
… lookup

Three bugs fixed:
- find_pod_latency_file: `e.rstrip("/").endswith("/")` was always False;
  fixed to `e.endswith("/")` for both qe_dirs and metrics_dirs filters
- find_pod_latency_file: add step_hint param to prefer the workload dir
  matching the orion regression (e.g. "node-density-cni") over unrelated ones
- run_node_rca_analysis: accept and forward step_hint
- slack_fetcher: extract workload name from "[workload]" bracket in errors_list
  and pass as step_hint so the right metrics file is found

Co-Authored-By: Claude Sonnet 4.6
…message

Avoids Slack's 3000-char block limit. Now posts a brief header message then
uploads the full report as node-rca.md via files_upload_v2. Test updated to
assert file content rather than block structure.

Co-Authored-By: Claude Sonnet 4.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant