Skip to content

Add experimental flake-check claude skill to Gen AI package#7350

Open
jharan1 wants to merge 13 commits intoopendatahub-io:mainfrom
jharan1:flake-check
Open

Add experimental flake-check claude skill to Gen AI package#7350
jharan1 wants to merge 13 commits intoopendatahub-io:mainfrom
jharan1:flake-check

Conversation

@jharan1
Copy link
Copy Markdown
Contributor

@jharan1 jharan1 commented Apr 22, 2026

https://redhat.atlassian.net/browse/RHOAIENG-58234

Description

This PR adds an experimental claude skill to help with scanning for flaky tests in our Github PRs, co-authored with claude.

Regarding the location of this skill - the Gen AI team are using /packages/gen-ai/.claude/skills for initial testing of skills, later they may be moved to the root .claude folder if they are adding value and useful to the wider team.

How Has This Been Tested?

  • some initial manual runs of the skill (with claude sonnet 4.6), more testing/feedback from others can follow

Test Impact

N/A

Request review criteria:

Self checklist (all need to be checked):

  • The developer has manually tested the changes and verified that the changes work
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has added tests or explained why testing cannot be added (unit or cypress tests for related changes)
  • The code follows our Best Practices (React coding standards, PatternFly usage, performance considerations)

If you have UI changes:

  • Included any necessary screenshots or gifs if it was a UI change.
  • Included tags to the UX team if it was a UI/UX change.

After the PR is posted & before it merges:

  • The developer has tested their solution on a cluster by using the image produced by the PR to main

Summary by CodeRabbit

  • New Features

    • Added a /flake-check command to investigate PR CI failures: fetch failing/pending checks, retrieve job logs, extract failing tests, detect reruns on the same commit, and aggregate recurring check/test patterns. Supports deep analysis and file-scoped scans.
    • Outputs structured JSON reports with confidence labels and actionable follow-ups (rerun suggestions, scan/file cross-checks).
  • Documentation

    • Added user-facing documentation describing commands, flags (deep, scan, file, time windows), report format, and recommended next steps.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress This PR is in WIP state label Apr 22, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

Adds a new Claude skill flake-check with documentation and three CLI Python scripts under packages/gen-ai/.claude/skills/flake-check/:

  • fetch_pr_failures.py: collects PR metadata and check-run summaries via gh, classifies checks (failing/pending/passing), resolves missing job IDs, and optionally detects rerun-on-same-SHA patterns.
  • fetch_test_failures.py: downloads and cleans GH Actions logs, detects test framework (cypress, jest, go), extracts failing tests and summary, and reports parse warnings and queue times.
  • scan_prs.py: scans PRs over a date/window, aggregates recurring failing check/test patterns, supports --deep and --file modes (which fetch logs and test-level data), and emits structured JSON reports.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Security concerns

  • CWE-78 (OS Command Injection): many run_gh() subprocess invocations. Ensure subprocess calls use argument lists (no shell=True), validate/whitelist PR/run/job IDs and repo names before passing to subprocess.
  • CWE-20 (Improper Input Validation): permissive parsing of detailsUrl, date windows, and CLI args. Add strict validators and canonicalizers for PR numbers, ISO/relative dates, run/job IDs, and filenames.
  • CWE-200 / CWE-532 (Sensitive Data Exposure): logs and GH outputs may include secrets or internal URLs. Redact tokens, workspace paths, and secrets from outputs and errors.
  • CWE-502 (Deserialization of Untrusted Data): JSON from gh/API is consumed without schema validation. Validate shapes (required keys/types) and fail-fast on mismatches.
  • CWE-400 (Uncontrolled Resource Consumption): scan_prs.py performs parallel per-PR API/log fetches. Add concurrency limits, rate-limit handling, exponential backoff, and timeouts.
  • CWE-755 (Incorrect Permission Assignment for Critical Resource): exposing full logs and internal job metadata may leak sensitive test data. Limit fields returned, consider opt-in for deep fetches, and document retention/access controls.

Actionable fixes: replace shell invocations with subprocess.run([...]) or GH API client, add JSON schema validation, implement input whitelists/regex checks, cap concurrency and add retries/backoff, and sanitize/redact log outputs.

Code quality issues

  • Mixed error semantics: scripts sometimes exit with JSON errors and sometimes return empty results. Standardize to structured error objects plus non-zero exit codes and document CLI exit behavior.
  • Brittle URL/ID extraction: regex-based extract_ids and job resolution will break on URL format changes. Prefer GitHub API fields (GraphQL/REST) or stricter parsers; add unit tests for edge cases.
  • Fragile log-cleaning heuristics: ANSI stripping and GH-prefix removal assume fixed prefixes and may drop context. Centralize cleaning logic, parameterize prefix patterns, and add tests.
  • Heuristic-heavy framework detection: detection and parsing are intermingled. Separate detection from parsing, return parser confidence scores, and surface parse_warnings with standardized codes.
  • Loose typing and ad-hoc dicts: many functions return untyped dicts/lists. Introduce TypedDict/dataclasses for PR summaries, check entries, and test-failure records and add type annotations.
  • Unbounded/unnamed parallelism: use a controlled ThreadPool/ProcessPool with configurable concurrency and backoff on GH API failures; add metrics for failures/timeouts.
  • No structured logging/observability: add structured logs with levels, durations, and parse-warning categories; emit metrics for scanned PRs, fetched logs, and parse success rate.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly summarizes the main change: adding an experimental Claude skill for flake-check functionality to the Gen AI package.
Description check ✅ Passed PR description covers the objective (experimental flake-check Claude skill), rationale (location/scope/future migration), and testing approach (manual runs with Claude Sonnet 4.6), but leaves all self-checklist items unchecked despite providing some test justification.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ederign for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jharan1 jharan1 marked this pull request as ready for review April 22, 2026 15:29
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress This PR is in WIP state label Apr 22, 2026
@openshift-ci openshift-ci Bot requested review from iamemilio and rhdedgar April 22, 2026 15:29
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_pr_failures.py`:
- Around line 91-116: The check-run fetch currently calls run_gh once with
"per_page=100" and only gets the default filter (latest), so modify the logic in
fetch_pr_failures.py where run_gh is called to request
"check-runs?per_page=100&filter=all" and implement pagination: loop calling
run_gh with page increments (or follow GitHub API Link headers) to accumulate
all check_runs into data_list, merge all pages' "check_runs" before building
by_name, and keep the existing JSON decode and error handling; ensure you still
sort attempts and detect failures->success transitions for each check name as
before (references: run_gh call, data/get("check_runs"), by_name population, and
rerun_checks computation).

In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`:
- Around line 218-245: The regex used to find failing-test headers only matches
bullets at column 0; update the pattern (the variable named pattern used in
pattern.finditer(text)) to allow optional leading whitespace (e.g. use
^\s*●\s+(.+?)$ with re.MULTILINE) so indented Jest headers are captured, then
leave the rest of the logic (positions list, the for loop over positions,
extracting block, guessed_file detection, and building failures entries)
unchanged so failing_tests is populated for indented output.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py`:
- Around line 415-430: After applying the args.file filter that narrows
pr_results (the block that builds matched_tests from all_pr_tests), you must
also rebuild the rerun_index/rerun_patterns derived from the original pr_results
so they no longer reference PRs that were filtered out; currently only
failure_index is rebuilt causing stale rerun entries to appear. Modify the same
filtered branch to recompute rerun_index (and any rerun_patterns or rerun_index
uses) from the filtered pr_results (same scope as where failure_index is
rebuilt), and apply the same change to the second identical filtering block
later in the file that also rebuilds failure_index.
- Around line 476-502: The current grouping uses only test name (test_index key
= name), which merges distinct tests from different files; change grouping to
use a composite key of file+test name (e.g., tuple (file, name) or a unique
string like f"{file}::{name}") when populating test_index so each (file,
test_name) pair is tracked separately; update the entry construction in the loop
that builds test_index (references: test_index, t.get("file"), name,
entry["pr_numbers"], entry["errors"]) to store both file and name in the entry,
and then update the output builder (references: output["test_patterns"],
"test_name", "file", failure_count, pr_numbers, errors) to iterate over these
composite keys and emit one record per (file, test_name) pair, keeping the
existing sorting and the >=2 occurrence filter.
- Around line 220-268: The call to the checks API must request filter=all and
paginate through pages (not just the first 100) and the per-check logic must
include pending/in-progress attempts when choosing the latest attempt; update
the run_gh call to iterate pages for
"repos/{repo}/commits/{sha}/check-runs?per_page=100&filter=all" (or equivalent)
until no more results, keep appending all check_runs to data, and then change
the grouping logic (by_name) to also add attempts for pending statuses (use
status as the attempt state when conclusion is empty), remove relying on
pending_names for latest-state detection, and select the latest attempt by
started_at for each name before deciding failure/success/pending; keep
references to the existing symbols _extract_run_job_ids, detect_reruns, by_name,
failing, and rerun_checks when implementing these changes.

In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md`:
- Around line 23-34: The fenced code blocks in SKILL.md (examples like the
/flake-check command blocks shown) lack language tags causing markdownlint
MD040; update each command/example block (e.g., the blocks containing
"/flake-check 7301", "/flake-check scan --deep", "/flake-check scan 7d", etc.)
to use bash as the language and change report/template/placeholder blocks to use
text; apply the same fixes in the other affected sections referenced (around
lines showing examples and report templates) so all fenced code blocks
consistently include either "bash" for commands or "text" for
templates/placeholders.
- Line 83: The doc line claiming "All scripts output JSON to stdout; errors go
to stderr" is incorrect; update the wording in SKILL.md (the sentence
referencing `${CLAUDE_SKILL_DIR}/scripts/`) to state that both normal results
and JSON error objects are emitted to stdout (not stderr) so the flake-check
skill reads error JSON from stdout; ensure the sentence explicitly says "errors
are emitted as JSON to stdout" and remove the reference to stderr.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: a2be4a23-f7da-4427-8f96-34465a09fab3

📥 Commits

Reviewing files that changed from the base of the PR and between cc83365 and 41ebf28.

📒 Files selected for processing (4)
  • packages/gen-ai/.claude/skills/flake-check/SKILL.md
  • packages/gen-ai/.claude/skills/flake-check/scripts/fetch_pr_failures.py
  • packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py
  • packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py

Comment thread packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py Outdated
Comment thread packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py Outdated
Comment thread packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py
Comment thread packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py Outdated
Comment thread packages/gen-ai/.claude/skills/flake-check/SKILL.md Outdated
Comment thread packages/gen-ai/.claude/skills/flake-check/SKILL.md Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.04%. Comparing base (543bc78) to head (b3d170d).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #7350   +/-   ##
=======================================
  Coverage   65.04%   65.04%           
=======================================
  Files        2458     2458           
  Lines       76354    76402   +48     
  Branches    19257    19278   +21     
=======================================
+ Hits        49662    49695   +33     
- Misses      26692    26707   +15     

see 17 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc83365...b3d170d. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (1)
packages/gen-ai/.claude/skills/flake-check/SKILL.md (1)

221-221: ⚠️ Potential issue | 🟡 Minor

MD040: fenced blocks still missing language tags.

Lines 221, 299, 394, and 445 still open bare ``` fences. Tag them text (report templates) per markdownlint-cli2.

Also applies to: 299-299, 394-394, 445-445

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md` at line 221, There are
bare fenced code blocks in SKILL.md used for report templates that are missing
language tags; update each of those triple-backtick fences (the report template
blocks) to use the text language tag by changing ``` to ```text for all
occurrences so markdownlint MD040 is satisfied.
🧹 Nitpick comments (3)
packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py (2)

198-203: Chain ArgumentTypeError (Ruff B904).

Preserve the original ValueError context so traceback noise stays useful.

-    try:
-        return date.fromisoformat(value.strip())
-    except ValueError:
-        raise argparse.ArgumentTypeError(
-            f"Invalid date '{value}'. Use ISO format (2026-04-01) or relative (7d, 2w)."
-        )
+    try:
+        return date.fromisoformat(value.strip())
+    except ValueError as err:
+        raise argparse.ArgumentTypeError(
+            f"Invalid date '{value}'. Use ISO format (2026-04-01) or relative (7d, 2w)."
+        ) from err
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines
198 - 203, The except block currently swallows the original ValueError when
date.fromisoformat fails; update the handler to re-raise
argparse.ArgumentTypeError while preserving the original exception context by
using the "raise ... from e" pattern (capture the ValueError as e and raise
argparse.ArgumentTypeError(...) from e), referencing date.fromisoformat and
argparse.ArgumentTypeError to locate the code.

123-133: Use sys.executable and surface script failures.

Two small things:

  • Hard-coded python3 (Ruff S607) is PATH-dependent and can silently pick a different interpreter than the one running the parent script; sys.executable is more reliable when invoking sibling scripts.
  • fetch_test_failures.py emits {"error": ...} to stdout on failure; json.loads(...).get("failing_tests", []) swallows those errors into empty lists with no warning. Consider logging a stderr diagnostic when r.returncode != 0 so scan reports don't silently drop entire checks.
Proposed fix
-    cmd = ["python3", script, run_id]
+    cmd = [sys.executable, script, run_id]
     if job_id:
         cmd += ["--job", job_id]
     try:
         r = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
-        return json.loads(r.stdout).get("failing_tests", [])
-    except Exception:
-        return []
+        data = json.loads(r.stdout) if r.stdout else {}
+        if r.returncode != 0 or "error" in data:
+            print(f"fetch_test_failures.py failed for run {run_id}: {data.get('error') or r.stderr.strip()}", file=sys.stderr)
+            return []
+        return data.get("failing_tests", [])
+    except (subprocess.TimeoutExpired, json.JSONDecodeError, OSError) as e:
+        print(f"fetch_test_failures.py invocation error for run {run_id}: {e}", file=sys.stderr)
+        return []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines
123 - 133, In _fetch_tests_for_check, replace the hard-coded "python3" in cmd
with sys.executable to ensure the same interpreter is used (update the cmd
construction that references script and run_id), and after running
subprocess.run (r), check r.returncode; if non-zero, parse r.stdout for an
{"error":...} field or include r.stderr and log or raise a diagnostic instead of
silently returning []; ensure the function surfaces failures (via logging or
raising) when r.returncode != 0 so failing_tests aren't silently dropped.
packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py (1)

347-358: Add strict=True to zip per Ruff B905.

raw_lines and stripped_lines are constructed to be the same length, but making the invariant explicit catches any future refactor that breaks it.

-    prefixed_pairs = [
-        (r, s) for r, s in zip(raw_lines, stripped_lines)
+    prefixed_pairs = [
+        (r, s) for r, s in zip(raw_lines, stripped_lines, strict=True)
         if "\t" in r and _timestamp_re.search(r)
     ][:20]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`
around lines 347 - 358, The zip over raw_lines and stripped_lines should be
strict to enforce the invariant that they are the same length; in the block that
builds prefixed_pairs (using _timestamp_re and variables raw_lines,
stripped_lines, prefixed_pairs and still_prefixed), change the zip(...) call to
zip(..., strict=True) so any future mismatch raises immediately rather than
silently producing truncated pairs; keep the same timestamp regex check and
slice [:20] logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`:
- Around line 241-245: The Jest parser is appending failure entries even when
error_parts is empty; update the code around the failures.append(...) in the
Jest parsing function (the block that uses test_name, guessed_file, and
error_parts) to only append when error_parts contains data (e.g., if
error_parts: failures.append({...})). Mirror the guard used in
parse_cypress_failures so stray matches that produce no error context are
skipped.
- Around line 93-111: detect_framework currently treats any check_name
containing "go" as Go which misclassifies names like "argocd" or "google";
update the check in detect_framework to require word-boundary or prefix matches
instead of substring: replace the `if "go" in cn or "bff" in cn:` logic with a
regex word-boundary test (e.g. match r'\bgo\b') and/or explicit prefix checks
(e.g. startswith "go-", "go_", or exact "go"), or better yet consult the
canonical test-runner taxonomy used in scan_prs.py/_DETERMINISTIC_PREFIXES and
match against that set for Go; keep existing lowercasing and preserve the
fallback scanning behavior.
- Around line 256-267: The parser currently skips any line starting with "---",
which discards legitimate error context; update the conditional so it only skips
explicit Go test separators. Replace the check "not line.startswith('---')" with
a regex that detects test separators only, e.g. "not re.match(r'^\s*---
(PASS|FAIL|SKIP):', line)", so indented separators are recognized and other
"---" lines (diffs, user output) are preserved; keep using fail_match =
re.match(r"--- FAIL: (\S+)\s+\(", line) and the same
current/error_parts/failures flow and ERROR_CONTEXT_LINES trimming.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py`:
- Around line 463-480: The filters.deep field in the final output currently uses
args.deep (which can be false even when --file forces deep mode); change it to
report the effective deep mode (use the existing run_deep variable or the
equivalent args.deep or bool(args.file) expression) so "filters": { "deep":
run_deep, ... } correctly reflects when deep-mode sections like
rerun_patterns/test_patterns are emitted.
- Around line 289-309: The --limit help text claims a different default
depending on whether --since is provided, but the code always sets
effective_limit = args.limit or 50; update the logic to honor the help text by
computing effective_limit based on whether args.limit is provided and whether
args.since is None (use 20 when no --since, 50 when --since set), or
alternatively change the parser.add_argument("--limit"...) help string to state
an unconditional default of 50; locate the argument definition for "--limit" and
the computation effective_limit = args.limit or 50 and make the chosen single
consistent fix so help and behavior match.
- Around line 259-269: The code collects attempts for each check name then sorts
by started_at but queued runs have started_at == "" which sorts first and breaks
chronology; update the collection loop that sets started_at (variable started_at
from run.get("started_at")) to either (a) skip attempts where started_at is
falsy/None before appending into by_name[name], or (b) fallback to
run.get("created_at") when started_at is missing so attempts.append(...)
contains a reliable timestamp; ensure the sorting (attempts.sort(key=lambda x:
x[0])) and subsequent detect_reruns logic (variable detect_reruns loop over
attempts and rerun_checks.add(name)) and the later latest-state selection
operate only on entries with valid timestamps.

In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md`:
- Around line 362-370: The documented `filters` shape in SKILL.md is incomplete;
update the bullet under `scan_prs.py` so it lists the actual keys emitted by
scan_prs.py (including `limit_hit`, `bots_excluded`, and `file_filter` in
addition to `since`/`until`/`limit`/`deep`) so the docs match the script’s
output and the report template (which references `filters.limit_hit`). Locate
the `scan_prs.py` description paragraph and the filters bullet (the section that
currently lists `since`/`until`/`limit`/`deep`) and add the additional keys and
brief descriptions to reflect what the script emits.
- Around line 293-302: The Jira template's fenced code block after the
"**\"<test name>\"**" header is broken: the triple-backtick fence that begins
before "Type: Bug ..." only encloses through "Summary:" leaving "Description:"
and "AC:" outside as plain text; fix by moving the closing ``` so the
"Description:" and "AC:" lines are inside the same fenced block (or
alternatively split into two clearly labeled blocks), i.e., adjust the closing
triple-backtick for the fenced block that contains the "Type: Bug..." and
"Summary: ..." lines so it also contains "Description: ..." and "AC: ...".

---

Duplicate comments:
In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md`:
- Line 221: There are bare fenced code blocks in SKILL.md used for report
templates that are missing language tags; update each of those triple-backtick
fences (the report template blocks) to use the text language tag by changing ```
to ```text for all occurrences so markdownlint MD040 is satisfied.

---

Nitpick comments:
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`:
- Around line 347-358: The zip over raw_lines and stripped_lines should be
strict to enforce the invariant that they are the same length; in the block that
builds prefixed_pairs (using _timestamp_re and variables raw_lines,
stripped_lines, prefixed_pairs and still_prefixed), change the zip(...) call to
zip(..., strict=True) so any future mismatch raises immediately rather than
silently producing truncated pairs; keep the same timestamp regex check and
slice [:20] logic unchanged.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py`:
- Around line 198-203: The except block currently swallows the original
ValueError when date.fromisoformat fails; update the handler to re-raise
argparse.ArgumentTypeError while preserving the original exception context by
using the "raise ... from e" pattern (capture the ValueError as e and raise
argparse.ArgumentTypeError(...) from e), referencing date.fromisoformat and
argparse.ArgumentTypeError to locate the code.
- Around line 123-133: In _fetch_tests_for_check, replace the hard-coded
"python3" in cmd with sys.executable to ensure the same interpreter is used
(update the cmd construction that references script and run_id), and after
running subprocess.run (r), check r.returncode; if non-zero, parse r.stdout for
an {"error":...} field or include r.stderr and log or raise a diagnostic instead
of silently returning []; ensure the function surfaces failures (via logging or
raising) when r.returncode != 0 so failing_tests aren't silently dropped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: f8fa859f-3dd6-42ad-b54d-c4b43b355b72

📥 Commits

Reviewing files that changed from the base of the PR and between b3d170d and cb079a0.

📒 Files selected for processing (3)
  • packages/gen-ai/.claude/skills/flake-check/SKILL.md
  • packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py
  • packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py

Comment on lines +93 to +111
def detect_framework(lines: list[str], check_name: str | None = None) -> str:
# Fast path: infer from the job name (reliable and avoids scanning the log)
if check_name:
cn = check_name.lower()
if "cypress" in cn:
return "cypress"
if "unit" in cn or "jest" in cn:
return "jest"
if "go" in cn or "bff" in cn:
return "go"
# Fall back: scan full cleaned log (not just first 200 — failures can appear anywhere)
sample = "\n".join(lines).lower()
if "cypress" in sample or "cy." in sample or "timed out retrying" in sample:
return "cypress"
if "jest" in sample or "● " in sample or ".spec." in sample:
return "jest"
if "--- fail:" in sample or "go test" in sample:
return "go"
return "unknown"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Substring match for "go" in check_name is way too permissive.

"go" in cn will true-match common check names like argocd, google, dialog, gomega, logo, etc., and misclassify them as Go tests. Anchor the match to word boundaries (or check for known prefixes consistent with _DETERMINISTIC_PREFIXES/scan_prs.py's test-runner taxonomy).

Proposed fix
-    if check_name:
-        cn = check_name.lower()
-        if "cypress" in cn:
-            return "cypress"
-        if "unit" in cn or "jest" in cn:
-            return "jest"
-        if "go" in cn or "bff" in cn:
-            return "go"
+    if check_name:
+        cn = check_name.lower()
+        if "cypress" in cn:
+            return "cypress"
+        if "unit" in cn or "jest" in cn:
+            return "jest"
+        if re.search(r"\bgo\b", cn) or "bff" in cn or "go-test" in cn:
+            return "go"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`
around lines 93 - 111, detect_framework currently treats any check_name
containing "go" as Go which misclassifies names like "argocd" or "google";
update the check in detect_framework to require word-boundary or prefix matches
instead of substring: replace the `if "go" in cn or "bff" in cn:` logic with a
regex word-boundary test (e.g. match r'\bgo\b') and/or explicit prefix checks
(e.g. startswith "go-", "go_", or exact "go"), or better yet consult the
canonical test-runner taxonomy used in scan_prs.py/_DETERMINISTIC_PREFIXES and
match against that set for Go; keep existing lowercasing and preserve the
fallback scanning behavior.

Comment on lines +241 to +245
failures.append({
"name": test_name.strip(),
"file": guessed_file,
"error": "\n".join(error_parts),
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Jest parser emits empty failure entries.

Unlike parse_cypress_failures (which requires error_parts before appending at line 197), this always appends regardless of whether any error context was captured. Any stray line matching ^\s*●\s+... (e.g., in console output, snapshots, or unrelated log noise) becomes a spurious "failing test" with an empty error. Mirror the Cypress gate.

Proposed fix
-        failures.append({
-            "name": test_name.strip(),
-            "file": guessed_file,
-            "error": "\n".join(error_parts),
-        })
+        if error_parts:
+            failures.append({
+                "name": test_name.strip(),
+                "file": guessed_file,
+                "error": "\n".join(error_parts),
+            })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
failures.append({
"name": test_name.strip(),
"file": guessed_file,
"error": "\n".join(error_parts),
})
if error_parts:
failures.append({
"name": test_name.strip(),
"file": guessed_file,
"error": "\n".join(error_parts),
})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`
around lines 241 - 245, The Jest parser is appending failure entries even when
error_parts is empty; update the code around the failures.append(...) in the
Jest parsing function (the block that uses test_name, guessed_file, and
error_parts) to only append when error_parts contains data (e.g., if
error_parts: failures.append({...})). Mirror the guard used in
parse_cypress_failures so stray matches that produce no error context are
skipped.

Comment on lines +256 to +267
for line in lines:
fail_match = re.match(r"--- FAIL: (\S+)\s+\(", line)
if fail_match:
if current:
failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])})
current = fail_match.group(1)
error_parts = []
elif current and line.strip() and not line.startswith("---"):
error_parts.append(line.strip())

if current:
failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Go parser eats legitimate error context starting with ---.

not line.startswith("---") is meant to skip --- PASS: / --- FAIL: separators, but Go test output frequently contains user-printed lines or diff markers that begin with --- (e.g., cmp.Diff output). The filter also runs after stripping no whitespace, so indented --- FAIL: sub-test markers are captured as error lines. Consider skipping only --- (PASS|FAIL|SKIP): explicitly.

Proposed fix
-        elif current and line.strip() and not line.startswith("---"):
-            error_parts.append(line.strip())
+        elif current and line.strip():
+            if re.match(r"^\s*---\s+(PASS|FAIL|SKIP):", line):
+                continue
+            error_parts.append(line.strip())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for line in lines:
fail_match = re.match(r"--- FAIL: (\S+)\s+\(", line)
if fail_match:
if current:
failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])})
current = fail_match.group(1)
error_parts = []
elif current and line.strip() and not line.startswith("---"):
error_parts.append(line.strip())
if current:
failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])})
for line in lines:
fail_match = re.match(r"--- FAIL: (\S+)\s+\(", line)
if fail_match:
if current:
failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])})
current = fail_match.group(1)
error_parts = []
elif current and line.strip():
if re.match(r"^\s*---\s+(PASS|FAIL|SKIP):", line):
continue
error_parts.append(line.strip())
if current:
failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`
around lines 256 - 267, The parser currently skips any line starting with "---",
which discards legitimate error context; update the conditional so it only skips
explicit Go test separators. Replace the check "not line.startswith('---')" with
a regex that detects test separators only, e.g. "not re.match(r'^\s*---
(PASS|FAIL|SKIP):', line)", so indented separators are recognized and other
"---" lines (diffs, user output) are preserved; keep using fail_match =
re.match(r"--- FAIL: (\S+)\s+\(", line) and the same
current/error_parts/failures flow and ERROR_CONTEXT_LINES trimming.

Comment on lines +259 to +269
for name, attempts in by_name.items():
attempts.sort(key=lambda x: x[0]) # chronological order

if detect_reruns:
seen_failure = False
for _, state, _ in attempts:
if state in ("failure", "timed_out"):
seen_failure = True
elif state == "success" and seen_failure:
rerun_checks.add(name)
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

GitHub REST API check-runs response when status is queued does started_at field populated

💡 Result:

No, the started_at field is not populated when the status is queued. It is defined as a required string or null in the GitHub REST API check-runs response schema. For queued check runs, which indicate the check has not yet started, started_at is null. Examples in the docs show started_at provided only when creating/updating to in_progress or completed statuses. When a check suite is rerequested, it resets to queued without updating the check run, implying timestamps like started_at remain as-is (likely null if previously unset). Check run actions create queued runs without specifying started_at.

Citations:


🏁 Script executed:

cat -n packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py | head -280 | tail -50

Repository: opendatahub-io/odh-dashboard

Length of output: 2295


Rerun detection and latest-state selection fail with queued attempts.

On line 244, started_at = run.get("started_at") or "" converts null timestamps (which GitHub's API returns for queued check runs) to empty strings. When sorted on line 260, these empty strings move to the front, breaking the chronological order assumption and causing both rerun detection (lines 264–269) and the latest-state lookup on line 271 to operate on misaligned data.

Filter out attempts with missing started_at before sorting, or fall back to an alternate timestamp field that's always populated (e.g., created_at).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines
259 - 269, The code collects attempts for each check name then sorts by
started_at but queued runs have started_at == "" which sorts first and breaks
chronology; update the collection loop that sets started_at (variable started_at
from run.get("started_at")) to either (a) skip attempts where started_at is
falsy/None before appending into by_name[name], or (b) fallback to
run.get("created_at") when started_at is missing so attempts.append(...)
contains a reliable timestamp; ensure the sorting (attempts.sort(key=lambda x:
x[0])) and subsequent detect_reruns logic (variable detect_reruns loop over
attempts and rerun_checks.add(name)) and the later latest-state selection
operate only on entries with valid timestamps.

Comment on lines +289 to +309
parser.add_argument("--limit", type=int, default=None,
help="Hard cap on number of PRs after filtering (default: 50 with --since, 20 otherwise)")
parser.add_argument("--deep", action="store_true",
help="Detect rerun patterns: checks that failed then passed on the same SHA")
parser.add_argument("--include-bots", action="store_true",
help="Include bot/automation PRs (excluded by default)")
parser.add_argument("--repo", default=None, help="GitHub repo in owner/name format")
parser.add_argument("--file", default=None, metavar="FILENAME",
help="Filter to PRs where this filename appeared in failing test logs (fetches CI logs — slower)")
args = parser.parse_args()
run_deep = args.deep or bool(args.file)

repo = args.repo or detect_repo()
if not repo:
print(json.dumps({"error": "Could not detect GitHub repo. Pass --repo owner/name explicitly."}))
sys.exit(1)

if args.since is None and args.until is None:
args.since = parse_date_arg("7d")

effective_limit = args.limit or 50
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--limit default contradicts its own help text.

Help on line 290 says "default: 50 with --since, 20 otherwise", but line 309 unconditionally falls back to 50. Either fix the help text or branch on args.since.

Proposed fix
-    effective_limit = args.limit or 50
+    effective_limit = args.limit or (50 if args.since else 20)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines
289 - 309, The --limit help text claims a different default depending on whether
--since is provided, but the code always sets effective_limit = args.limit or
50; update the logic to honor the help text by computing effective_limit based
on whether args.limit is provided and whether args.since is None (use 20 when no
--since, 50 when --since set), or alternatively change the
parser.add_argument("--limit"...) help string to state an unconditional default
of 50; locate the argument definition for "--limit" and the computation
effective_limit = args.limit or 50 and make the chosen single consistent fix so
help and behavior match.

Comment on lines +463 to +480
output: dict = {
"repo": repo,
"scanned": scanned_count,
"bots_excluded": bots_excluded,
"with_failures": with_failures,
"all_passing_count": all_passing_count,
"filters": {
"since": args.since.isoformat() if args.since else None,
"until": args.until.isoformat() if args.until else None,
"limit": effective_limit,
"limit_hit": limit_hit,
"bots_excluded": not args.include_bots,
"deep": args.deep,
"file_filter": args.file,
},
"prs": pr_results,
"patterns": patterns,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

filters.deep understates effective mode when --file is used.

--file implies deep (line 299, run_deep = args.deep or bool(args.file)) and is the trigger for emitting rerun_patterns/test_patterns. Reporting "deep": args.deep makes the filter block say deep: false even though the output contains the deep-mode sections — misleading for any consumer keying off the metadata.

Proposed fix
-            "deep": args.deep,
+            "deep": run_deep,
             "file_filter": args.file,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output: dict = {
"repo": repo,
"scanned": scanned_count,
"bots_excluded": bots_excluded,
"with_failures": with_failures,
"all_passing_count": all_passing_count,
"filters": {
"since": args.since.isoformat() if args.since else None,
"until": args.until.isoformat() if args.until else None,
"limit": effective_limit,
"limit_hit": limit_hit,
"bots_excluded": not args.include_bots,
"deep": args.deep,
"file_filter": args.file,
},
"prs": pr_results,
"patterns": patterns,
}
output: dict = {
"repo": repo,
"scanned": scanned_count,
"bots_excluded": bots_excluded,
"with_failures": with_failures,
"all_passing_count": all_passing_count,
"filters": {
"since": args.since.isoformat() if args.since else None,
"until": args.until.isoformat() if args.until else None,
"limit": effective_limit,
"limit_hit": limit_hit,
"bots_excluded": not args.include_bots,
"deep": run_deep,
"file_filter": args.file,
},
"prs": pr_results,
"patterns": patterns,
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines
463 - 480, The filters.deep field in the final output currently uses args.deep
(which can be false even when --file forces deep mode); change it to report the
effective deep mode (use the existing run_deep variable or the equivalent
args.deep or bool(args.file) expression) so "filters": { "deep": run_deep, ... }
correctly reflects when deep-mode sections like rerun_patterns/test_patterns are
emitted.

Comment on lines +293 to +302
<For each ⚠️ Suspected Flaky test in the report, generate one block. Substitute all `<...>` placeholders except `<team label if known ...>` and `<team component if known>` — render those two verbatim so the user knows they must supply those values themselves.>

**"<test name>"**
```
Type: Bug | Priority: Normal | Labels: flaky-test, <team label if known e.g. dashboard-crimson-scrum for Gen AI Studio area> | Component: <team component if known>
Summary: test(mock): "<test name>" is intermittently failing in CI
```
Description: test `<file>`, area `<area>`, symptom `<error pattern>`, PRs: `<pr list>`
AC: Investigate root cause (timing race, missing guard, selector instability); fix; verify stable on RHOAI and ODH clusters; remove `@Maintain` tag.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Jira template is split across a fence boundary.

Line 296 opens ``` and line 299 closes it after Summary:, but `Description:` (line 300) and `AC:` (line 301) sit outside the fence as prose. Either close after `AC:` or split into two clearly separated blocks — right now the rendered output mixes a code block with plain text that looks like it should be part of the same ticket template.

Proposed fix
 **"<test name>"**
-```
+```text
 Type: Bug | Priority: Normal | Labels: flaky-test, <team label if known e.g. dashboard-crimson-scrum for Gen AI Studio area> | Component: <team component if known>
 Summary: test(mock): "<test name>" is intermittently failing in CI
+Description: test `<file>`, area `<area>`, symptom `<error pattern>`, PRs: `<pr list>`
+AC: Investigate root cause (timing race, missing guard, selector instability); fix; verify stable on RHOAI and ODH clusters; remove `@Maintain` tag.

-Description: test <file>, area <area>, symptom <error pattern>, PRs: <pr list>
-AC: Investigate root cause (timing race, missing guard, selector instability); fix; verify stable on RHOAI and ODH clusters; remove @Maintain tag.


</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>

[warning] 299-299: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @packages/gen-ai/.claude/skills/flake-check/SKILL.md around lines 293 - 302,
The Jira template's fenced code block after the """" header is
broken: the triple-backtick fence that begins before "Type: Bug ..." only
encloses through "Summary:" leaving "Description:" and "AC:" outside as plain
text; fix by moving the closing ``` so the "Description:" and "AC:" lines are
inside the same fenced block (or alternatively split into two clearly labeled
blocks), i.e., adjust the closing triple-backtick for the fenced block that
contains the "Type: Bug..." and "Summary: ..." lines so it also contains
"Description: ..." and "AC: ...".


</details>

<!-- fingerprinting:phantom:medusa:nectarine:d5d6d8e9-40a2-4a85-a995-ec4b9ea27067 -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +362 to +370
`scan_prs.py` returns:
- `prs` — list of PRs with their failing check names
- `patterns` — check names that **visibly failed** on more than one PR, each with `failure_count` and the list of PR numbers it failed on
- `rerun_patterns` — *(with `--deep` or `--file`)* check names that **failed then passed on the same commit SHA** on two or more PRs, with `rerun_count` and PR numbers. This surfaces flaky tests that devs routinely re-run — they won't appear in `patterns` because `statusCheckRollup` only shows the final passing state.
- `test_patterns` — *(with `--deep` or `--file`)* **individual test names** that failed on two or more PRs, each with `failure_count`, `pr_numbers`, and distinct `errors` seen. A test appearing here means the *same specific `it()` block* recurred across PRs — much stronger evidence than the same job recurring. With `--deep`: fetches logs for all failing checks not excluded by `_DETERMINISTIC_PREFIXES` across all scanned PRs (the same test can recur across different check matrix variants — this catches it). With `--file`: same log fetching but filtered to a specific file.
- `bots_excluded` — count of bot PRs filtered out
- `all_passing_count` — PRs where everything passed
- `filters` — the resolved `since`/`until`/`limit`/`deep` values actually used

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Drift between scan_prs.py output and documented filters shape.

Line 369 lists filters as containing only since/until/limit/deep, but the script (scan_prs.py lines 469–477) also emits limit_hit, bots_excluded, and file_filter. Since the report template on line 398 already relies on filters.limit_hit, update this bullet so consumers don't think those fields are undocumented/unsupported.

-- `filters` — the resolved `since`/`until`/`limit`/`deep` values actually used
+- `filters` — the resolved `since`/`until`/`limit`/`limit_hit`/`deep`/`bots_excluded`/`file_filter` values actually used
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md` around lines 362 - 370,
The documented `filters` shape in SKILL.md is incomplete; update the bullet
under `scan_prs.py` so it lists the actual keys emitted by scan_prs.py
(including `limit_hit`, `bots_excluded`, and `file_filter` in addition to
`since`/`until`/`limit`/`deep`) so the docs match the script’s output and the
report template (which references `filters.limit_hit`). Locate the `scan_prs.py`
description paragraph and the filters bullet (the section that currently lists
`since`/`until`/`limit`/`deep`) and add the additional keys and brief
descriptions to reflect what the script emits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant