Add experimental flake-check claude skill to Gen AI package#7350
Add experimental flake-check claude skill to Gen AI package#7350jharan1 wants to merge 13 commits intoopendatahub-io:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughAdds a new Claude skill
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Security concerns
Actionable fixes: replace shell invocations with subprocess.run([...]) or GH API client, add JSON schema validation, implement input whitelists/regex checks, cap concurrency and add retries/backoff, and sanitize/redact log outputs. Code quality issues
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_pr_failures.py`:
- Around line 91-116: The check-run fetch currently calls run_gh once with
"per_page=100" and only gets the default filter (latest), so modify the logic in
fetch_pr_failures.py where run_gh is called to request
"check-runs?per_page=100&filter=all" and implement pagination: loop calling
run_gh with page increments (or follow GitHub API Link headers) to accumulate
all check_runs into data_list, merge all pages' "check_runs" before building
by_name, and keep the existing JSON decode and error handling; ensure you still
sort attempts and detect failures->success transitions for each check name as
before (references: run_gh call, data/get("check_runs"), by_name population, and
rerun_checks computation).
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`:
- Around line 218-245: The regex used to find failing-test headers only matches
bullets at column 0; update the pattern (the variable named pattern used in
pattern.finditer(text)) to allow optional leading whitespace (e.g. use
^\s*●\s+(.+?)$ with re.MULTILINE) so indented Jest headers are captured, then
leave the rest of the logic (positions list, the for loop over positions,
extracting block, guessed_file detection, and building failures entries)
unchanged so failing_tests is populated for indented output.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py`:
- Around line 415-430: After applying the args.file filter that narrows
pr_results (the block that builds matched_tests from all_pr_tests), you must
also rebuild the rerun_index/rerun_patterns derived from the original pr_results
so they no longer reference PRs that were filtered out; currently only
failure_index is rebuilt causing stale rerun entries to appear. Modify the same
filtered branch to recompute rerun_index (and any rerun_patterns or rerun_index
uses) from the filtered pr_results (same scope as where failure_index is
rebuilt), and apply the same change to the second identical filtering block
later in the file that also rebuilds failure_index.
- Around line 476-502: The current grouping uses only test name (test_index key
= name), which merges distinct tests from different files; change grouping to
use a composite key of file+test name (e.g., tuple (file, name) or a unique
string like f"{file}::{name}") when populating test_index so each (file,
test_name) pair is tracked separately; update the entry construction in the loop
that builds test_index (references: test_index, t.get("file"), name,
entry["pr_numbers"], entry["errors"]) to store both file and name in the entry,
and then update the output builder (references: output["test_patterns"],
"test_name", "file", failure_count, pr_numbers, errors) to iterate over these
composite keys and emit one record per (file, test_name) pair, keeping the
existing sorting and the >=2 occurrence filter.
- Around line 220-268: The call to the checks API must request filter=all and
paginate through pages (not just the first 100) and the per-check logic must
include pending/in-progress attempts when choosing the latest attempt; update
the run_gh call to iterate pages for
"repos/{repo}/commits/{sha}/check-runs?per_page=100&filter=all" (or equivalent)
until no more results, keep appending all check_runs to data, and then change
the grouping logic (by_name) to also add attempts for pending statuses (use
status as the attempt state when conclusion is empty), remove relying on
pending_names for latest-state detection, and select the latest attempt by
started_at for each name before deciding failure/success/pending; keep
references to the existing symbols _extract_run_job_ids, detect_reruns, by_name,
failing, and rerun_checks when implementing these changes.
In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md`:
- Around line 23-34: The fenced code blocks in SKILL.md (examples like the
/flake-check command blocks shown) lack language tags causing markdownlint
MD040; update each command/example block (e.g., the blocks containing
"/flake-check 7301", "/flake-check scan --deep", "/flake-check scan 7d", etc.)
to use bash as the language and change report/template/placeholder blocks to use
text; apply the same fixes in the other affected sections referenced (around
lines showing examples and report templates) so all fenced code blocks
consistently include either "bash" for commands or "text" for
templates/placeholders.
- Line 83: The doc line claiming "All scripts output JSON to stdout; errors go
to stderr" is incorrect; update the wording in SKILL.md (the sentence
referencing `${CLAUDE_SKILL_DIR}/scripts/`) to state that both normal results
and JSON error objects are emitted to stdout (not stderr) so the flake-check
skill reads error JSON from stdout; ensure the sentence explicitly says "errors
are emitted as JSON to stdout" and remove the reference to stderr.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: a2be4a23-f7da-4427-8f96-34465a09fab3
📒 Files selected for processing (4)
packages/gen-ai/.claude/skills/flake-check/SKILL.mdpackages/gen-ai/.claude/skills/flake-check/scripts/fetch_pr_failures.pypackages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.pypackages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7350 +/- ##
=======================================
Coverage 65.04% 65.04%
=======================================
Files 2458 2458
Lines 76354 76402 +48
Branches 19257 19278 +21
=======================================
+ Hits 49662 49695 +33
- Misses 26692 26707 +15 see 17 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (1)
packages/gen-ai/.claude/skills/flake-check/SKILL.md (1)
221-221:⚠️ Potential issue | 🟡 MinorMD040: fenced blocks still missing language tags.
Lines 221, 299, 394, and 445 still open bare ``` fences. Tag them
text(report templates) per markdownlint-cli2.Also applies to: 299-299, 394-394, 445-445
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md` at line 221, There are bare fenced code blocks in SKILL.md used for report templates that are missing language tags; update each of those triple-backtick fences (the report template blocks) to use the text language tag by changing ``` to ```text for all occurrences so markdownlint MD040 is satisfied.
🧹 Nitpick comments (3)
packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py (2)
198-203: ChainArgumentTypeError(Ruff B904).Preserve the original
ValueErrorcontext so traceback noise stays useful.- try: - return date.fromisoformat(value.strip()) - except ValueError: - raise argparse.ArgumentTypeError( - f"Invalid date '{value}'. Use ISO format (2026-04-01) or relative (7d, 2w)." - ) + try: + return date.fromisoformat(value.strip()) + except ValueError as err: + raise argparse.ArgumentTypeError( + f"Invalid date '{value}'. Use ISO format (2026-04-01) or relative (7d, 2w)." + ) from err🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines 198 - 203, The except block currently swallows the original ValueError when date.fromisoformat fails; update the handler to re-raise argparse.ArgumentTypeError while preserving the original exception context by using the "raise ... from e" pattern (capture the ValueError as e and raise argparse.ArgumentTypeError(...) from e), referencing date.fromisoformat and argparse.ArgumentTypeError to locate the code.
123-133: Usesys.executableand surface script failures.Two small things:
- Hard-coded
python3(Ruff S607) is PATH-dependent and can silently pick a different interpreter than the one running the parent script;sys.executableis more reliable when invoking sibling scripts.fetch_test_failures.pyemits{"error": ...}to stdout on failure;json.loads(...).get("failing_tests", [])swallows those errors into empty lists with no warning. Consider logging a stderr diagnostic whenr.returncode != 0so scan reports don't silently drop entire checks.Proposed fix
- cmd = ["python3", script, run_id] + cmd = [sys.executable, script, run_id] if job_id: cmd += ["--job", job_id] try: r = subprocess.run(cmd, capture_output=True, text=True, timeout=60) - return json.loads(r.stdout).get("failing_tests", []) - except Exception: - return [] + data = json.loads(r.stdout) if r.stdout else {} + if r.returncode != 0 or "error" in data: + print(f"fetch_test_failures.py failed for run {run_id}: {data.get('error') or r.stderr.strip()}", file=sys.stderr) + return [] + return data.get("failing_tests", []) + except (subprocess.TimeoutExpired, json.JSONDecodeError, OSError) as e: + print(f"fetch_test_failures.py invocation error for run {run_id}: {e}", file=sys.stderr) + return []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines 123 - 133, In _fetch_tests_for_check, replace the hard-coded "python3" in cmd with sys.executable to ensure the same interpreter is used (update the cmd construction that references script and run_id), and after running subprocess.run (r), check r.returncode; if non-zero, parse r.stdout for an {"error":...} field or include r.stderr and log or raise a diagnostic instead of silently returning []; ensure the function surfaces failures (via logging or raising) when r.returncode != 0 so failing_tests aren't silently dropped.packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py (1)
347-358: Addstrict=Truetozipper Ruff B905.
raw_linesandstripped_linesare constructed to be the same length, but making the invariant explicit catches any future refactor that breaks it.- prefixed_pairs = [ - (r, s) for r, s in zip(raw_lines, stripped_lines) + prefixed_pairs = [ + (r, s) for r, s in zip(raw_lines, stripped_lines, strict=True) if "\t" in r and _timestamp_re.search(r) ][:20]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py` around lines 347 - 358, The zip over raw_lines and stripped_lines should be strict to enforce the invariant that they are the same length; in the block that builds prefixed_pairs (using _timestamp_re and variables raw_lines, stripped_lines, prefixed_pairs and still_prefixed), change the zip(...) call to zip(..., strict=True) so any future mismatch raises immediately rather than silently producing truncated pairs; keep the same timestamp regex check and slice [:20] logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`:
- Around line 241-245: The Jest parser is appending failure entries even when
error_parts is empty; update the code around the failures.append(...) in the
Jest parsing function (the block that uses test_name, guessed_file, and
error_parts) to only append when error_parts contains data (e.g., if
error_parts: failures.append({...})). Mirror the guard used in
parse_cypress_failures so stray matches that produce no error context are
skipped.
- Around line 93-111: detect_framework currently treats any check_name
containing "go" as Go which misclassifies names like "argocd" or "google";
update the check in detect_framework to require word-boundary or prefix matches
instead of substring: replace the `if "go" in cn or "bff" in cn:` logic with a
regex word-boundary test (e.g. match r'\bgo\b') and/or explicit prefix checks
(e.g. startswith "go-", "go_", or exact "go"), or better yet consult the
canonical test-runner taxonomy used in scan_prs.py/_DETERMINISTIC_PREFIXES and
match against that set for Go; keep existing lowercasing and preserve the
fallback scanning behavior.
- Around line 256-267: The parser currently skips any line starting with "---",
which discards legitimate error context; update the conditional so it only skips
explicit Go test separators. Replace the check "not line.startswith('---')" with
a regex that detects test separators only, e.g. "not re.match(r'^\s*---
(PASS|FAIL|SKIP):', line)", so indented separators are recognized and other
"---" lines (diffs, user output) are preserved; keep using fail_match =
re.match(r"--- FAIL: (\S+)\s+\(", line) and the same
current/error_parts/failures flow and ERROR_CONTEXT_LINES trimming.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py`:
- Around line 463-480: The filters.deep field in the final output currently uses
args.deep (which can be false even when --file forces deep mode); change it to
report the effective deep mode (use the existing run_deep variable or the
equivalent args.deep or bool(args.file) expression) so "filters": { "deep":
run_deep, ... } correctly reflects when deep-mode sections like
rerun_patterns/test_patterns are emitted.
- Around line 289-309: The --limit help text claims a different default
depending on whether --since is provided, but the code always sets
effective_limit = args.limit or 50; update the logic to honor the help text by
computing effective_limit based on whether args.limit is provided and whether
args.since is None (use 20 when no --since, 50 when --since set), or
alternatively change the parser.add_argument("--limit"...) help string to state
an unconditional default of 50; locate the argument definition for "--limit" and
the computation effective_limit = args.limit or 50 and make the chosen single
consistent fix so help and behavior match.
- Around line 259-269: The code collects attempts for each check name then sorts
by started_at but queued runs have started_at == "" which sorts first and breaks
chronology; update the collection loop that sets started_at (variable started_at
from run.get("started_at")) to either (a) skip attempts where started_at is
falsy/None before appending into by_name[name], or (b) fallback to
run.get("created_at") when started_at is missing so attempts.append(...)
contains a reliable timestamp; ensure the sorting (attempts.sort(key=lambda x:
x[0])) and subsequent detect_reruns logic (variable detect_reruns loop over
attempts and rerun_checks.add(name)) and the later latest-state selection
operate only on entries with valid timestamps.
In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md`:
- Around line 362-370: The documented `filters` shape in SKILL.md is incomplete;
update the bullet under `scan_prs.py` so it lists the actual keys emitted by
scan_prs.py (including `limit_hit`, `bots_excluded`, and `file_filter` in
addition to `since`/`until`/`limit`/`deep`) so the docs match the script’s
output and the report template (which references `filters.limit_hit`). Locate
the `scan_prs.py` description paragraph and the filters bullet (the section that
currently lists `since`/`until`/`limit`/`deep`) and add the additional keys and
brief descriptions to reflect what the script emits.
- Around line 293-302: The Jira template's fenced code block after the
"**\"<test name>\"**" header is broken: the triple-backtick fence that begins
before "Type: Bug ..." only encloses through "Summary:" leaving "Description:"
and "AC:" outside as plain text; fix by moving the closing ``` so the
"Description:" and "AC:" lines are inside the same fenced block (or
alternatively split into two clearly labeled blocks), i.e., adjust the closing
triple-backtick for the fenced block that contains the "Type: Bug..." and
"Summary: ..." lines so it also contains "Description: ..." and "AC: ...".
---
Duplicate comments:
In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md`:
- Line 221: There are bare fenced code blocks in SKILL.md used for report
templates that are missing language tags; update each of those triple-backtick
fences (the report template blocks) to use the text language tag by changing ```
to ```text for all occurrences so markdownlint MD040 is satisfied.
---
Nitpick comments:
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`:
- Around line 347-358: The zip over raw_lines and stripped_lines should be
strict to enforce the invariant that they are the same length; in the block that
builds prefixed_pairs (using _timestamp_re and variables raw_lines,
stripped_lines, prefixed_pairs and still_prefixed), change the zip(...) call to
zip(..., strict=True) so any future mismatch raises immediately rather than
silently producing truncated pairs; keep the same timestamp regex check and
slice [:20] logic unchanged.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py`:
- Around line 198-203: The except block currently swallows the original
ValueError when date.fromisoformat fails; update the handler to re-raise
argparse.ArgumentTypeError while preserving the original exception context by
using the "raise ... from e" pattern (capture the ValueError as e and raise
argparse.ArgumentTypeError(...) from e), referencing date.fromisoformat and
argparse.ArgumentTypeError to locate the code.
- Around line 123-133: In _fetch_tests_for_check, replace the hard-coded
"python3" in cmd with sys.executable to ensure the same interpreter is used
(update the cmd construction that references script and run_id), and after
running subprocess.run (r), check r.returncode; if non-zero, parse r.stdout for
an {"error":...} field or include r.stderr and log or raise a diagnostic instead
of silently returning []; ensure the function surfaces failures (via logging or
raising) when r.returncode != 0 so failing_tests aren't silently dropped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited), Organization UI (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: f8fa859f-3dd6-42ad-b54d-c4b43b355b72
📒 Files selected for processing (3)
packages/gen-ai/.claude/skills/flake-check/SKILL.mdpackages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.pypackages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py
| def detect_framework(lines: list[str], check_name: str | None = None) -> str: | ||
| # Fast path: infer from the job name (reliable and avoids scanning the log) | ||
| if check_name: | ||
| cn = check_name.lower() | ||
| if "cypress" in cn: | ||
| return "cypress" | ||
| if "unit" in cn or "jest" in cn: | ||
| return "jest" | ||
| if "go" in cn or "bff" in cn: | ||
| return "go" | ||
| # Fall back: scan full cleaned log (not just first 200 — failures can appear anywhere) | ||
| sample = "\n".join(lines).lower() | ||
| if "cypress" in sample or "cy." in sample or "timed out retrying" in sample: | ||
| return "cypress" | ||
| if "jest" in sample or "● " in sample or ".spec." in sample: | ||
| return "jest" | ||
| if "--- fail:" in sample or "go test" in sample: | ||
| return "go" | ||
| return "unknown" |
There was a problem hiding this comment.
Substring match for "go" in check_name is way too permissive.
"go" in cn will true-match common check names like argocd, google, dialog, gomega, logo, etc., and misclassify them as Go tests. Anchor the match to word boundaries (or check for known prefixes consistent with _DETERMINISTIC_PREFIXES/scan_prs.py's test-runner taxonomy).
Proposed fix
- if check_name:
- cn = check_name.lower()
- if "cypress" in cn:
- return "cypress"
- if "unit" in cn or "jest" in cn:
- return "jest"
- if "go" in cn or "bff" in cn:
- return "go"
+ if check_name:
+ cn = check_name.lower()
+ if "cypress" in cn:
+ return "cypress"
+ if "unit" in cn or "jest" in cn:
+ return "jest"
+ if re.search(r"\bgo\b", cn) or "bff" in cn or "go-test" in cn:
+ return "go"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`
around lines 93 - 111, detect_framework currently treats any check_name
containing "go" as Go which misclassifies names like "argocd" or "google";
update the check in detect_framework to require word-boundary or prefix matches
instead of substring: replace the `if "go" in cn or "bff" in cn:` logic with a
regex word-boundary test (e.g. match r'\bgo\b') and/or explicit prefix checks
(e.g. startswith "go-", "go_", or exact "go"), or better yet consult the
canonical test-runner taxonomy used in scan_prs.py/_DETERMINISTIC_PREFIXES and
match against that set for Go; keep existing lowercasing and preserve the
fallback scanning behavior.
| failures.append({ | ||
| "name": test_name.strip(), | ||
| "file": guessed_file, | ||
| "error": "\n".join(error_parts), | ||
| }) |
There was a problem hiding this comment.
Jest parser emits empty failure entries.
Unlike parse_cypress_failures (which requires error_parts before appending at line 197), this always appends regardless of whether any error context was captured. Any stray line matching ^\s*●\s+... (e.g., in console output, snapshots, or unrelated log noise) becomes a spurious "failing test" with an empty error. Mirror the Cypress gate.
Proposed fix
- failures.append({
- "name": test_name.strip(),
- "file": guessed_file,
- "error": "\n".join(error_parts),
- })
+ if error_parts:
+ failures.append({
+ "name": test_name.strip(),
+ "file": guessed_file,
+ "error": "\n".join(error_parts),
+ })📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| failures.append({ | |
| "name": test_name.strip(), | |
| "file": guessed_file, | |
| "error": "\n".join(error_parts), | |
| }) | |
| if error_parts: | |
| failures.append({ | |
| "name": test_name.strip(), | |
| "file": guessed_file, | |
| "error": "\n".join(error_parts), | |
| }) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`
around lines 241 - 245, The Jest parser is appending failure entries even when
error_parts is empty; update the code around the failures.append(...) in the
Jest parsing function (the block that uses test_name, guessed_file, and
error_parts) to only append when error_parts contains data (e.g., if
error_parts: failures.append({...})). Mirror the guard used in
parse_cypress_failures so stray matches that produce no error context are
skipped.
| for line in lines: | ||
| fail_match = re.match(r"--- FAIL: (\S+)\s+\(", line) | ||
| if fail_match: | ||
| if current: | ||
| failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])}) | ||
| current = fail_match.group(1) | ||
| error_parts = [] | ||
| elif current and line.strip() and not line.startswith("---"): | ||
| error_parts.append(line.strip()) | ||
|
|
||
| if current: | ||
| failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])}) |
There was a problem hiding this comment.
Go parser eats legitimate error context starting with ---.
not line.startswith("---") is meant to skip --- PASS: / --- FAIL: separators, but Go test output frequently contains user-printed lines or diff markers that begin with --- (e.g., cmp.Diff output). The filter also runs after stripping no whitespace, so indented --- FAIL: sub-test markers are captured as error lines. Consider skipping only --- (PASS|FAIL|SKIP): explicitly.
Proposed fix
- elif current and line.strip() and not line.startswith("---"):
- error_parts.append(line.strip())
+ elif current and line.strip():
+ if re.match(r"^\s*---\s+(PASS|FAIL|SKIP):", line):
+ continue
+ error_parts.append(line.strip())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for line in lines: | |
| fail_match = re.match(r"--- FAIL: (\S+)\s+\(", line) | |
| if fail_match: | |
| if current: | |
| failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])}) | |
| current = fail_match.group(1) | |
| error_parts = [] | |
| elif current and line.strip() and not line.startswith("---"): | |
| error_parts.append(line.strip()) | |
| if current: | |
| failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])}) | |
| for line in lines: | |
| fail_match = re.match(r"--- FAIL: (\S+)\s+\(", line) | |
| if fail_match: | |
| if current: | |
| failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])}) | |
| current = fail_match.group(1) | |
| error_parts = [] | |
| elif current and line.strip(): | |
| if re.match(r"^\s*---\s+(PASS|FAIL|SKIP):", line): | |
| continue | |
| error_parts.append(line.strip()) | |
| if current: | |
| failures.append({"name": current, "file": None, "error": "\n".join(error_parts[:ERROR_CONTEXT_LINES])}) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/fetch_test_failures.py`
around lines 256 - 267, The parser currently skips any line starting with "---",
which discards legitimate error context; update the conditional so it only skips
explicit Go test separators. Replace the check "not line.startswith('---')" with
a regex that detects test separators only, e.g. "not re.match(r'^\s*---
(PASS|FAIL|SKIP):', line)", so indented separators are recognized and other
"---" lines (diffs, user output) are preserved; keep using fail_match =
re.match(r"--- FAIL: (\S+)\s+\(", line) and the same
current/error_parts/failures flow and ERROR_CONTEXT_LINES trimming.
| for name, attempts in by_name.items(): | ||
| attempts.sort(key=lambda x: x[0]) # chronological order | ||
|
|
||
| if detect_reruns: | ||
| seen_failure = False | ||
| for _, state, _ in attempts: | ||
| if state in ("failure", "timed_out"): | ||
| seen_failure = True | ||
| elif state == "success" and seen_failure: | ||
| rerun_checks.add(name) | ||
| break |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
GitHub REST API check-runs response when status is queued does started_at field populated
💡 Result:
No, the started_at field is not populated when the status is queued. It is defined as a required string or null in the GitHub REST API check-runs response schema. For queued check runs, which indicate the check has not yet started, started_at is null. Examples in the docs show started_at provided only when creating/updating to in_progress or completed statuses. When a check suite is rerequested, it resets to queued without updating the check run, implying timestamps like started_at remain as-is (likely null if previously unset). Check run actions create queued runs without specifying started_at.
Citations:
- 1: https://docs.github.com/en/rest/checks/runs
- 2: https://docs.github.com/rest/checks/runs
- 3: https://docs.github.com/v3/checks/runs
- 4: https://docs.github.com/en/rest/checks/runs?apiVersion=2026-03-10
🏁 Script executed:
cat -n packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py | head -280 | tail -50Repository: opendatahub-io/odh-dashboard
Length of output: 2295
Rerun detection and latest-state selection fail with queued attempts.
On line 244, started_at = run.get("started_at") or "" converts null timestamps (which GitHub's API returns for queued check runs) to empty strings. When sorted on line 260, these empty strings move to the front, breaking the chronological order assumption and causing both rerun detection (lines 264–269) and the latest-state lookup on line 271 to operate on misaligned data.
Filter out attempts with missing started_at before sorting, or fall back to an alternate timestamp field that's always populated (e.g., created_at).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines
259 - 269, The code collects attempts for each check name then sorts by
started_at but queued runs have started_at == "" which sorts first and breaks
chronology; update the collection loop that sets started_at (variable started_at
from run.get("started_at")) to either (a) skip attempts where started_at is
falsy/None before appending into by_name[name], or (b) fallback to
run.get("created_at") when started_at is missing so attempts.append(...)
contains a reliable timestamp; ensure the sorting (attempts.sort(key=lambda x:
x[0])) and subsequent detect_reruns logic (variable detect_reruns loop over
attempts and rerun_checks.add(name)) and the later latest-state selection
operate only on entries with valid timestamps.
| parser.add_argument("--limit", type=int, default=None, | ||
| help="Hard cap on number of PRs after filtering (default: 50 with --since, 20 otherwise)") | ||
| parser.add_argument("--deep", action="store_true", | ||
| help="Detect rerun patterns: checks that failed then passed on the same SHA") | ||
| parser.add_argument("--include-bots", action="store_true", | ||
| help="Include bot/automation PRs (excluded by default)") | ||
| parser.add_argument("--repo", default=None, help="GitHub repo in owner/name format") | ||
| parser.add_argument("--file", default=None, metavar="FILENAME", | ||
| help="Filter to PRs where this filename appeared in failing test logs (fetches CI logs — slower)") | ||
| args = parser.parse_args() | ||
| run_deep = args.deep or bool(args.file) | ||
|
|
||
| repo = args.repo or detect_repo() | ||
| if not repo: | ||
| print(json.dumps({"error": "Could not detect GitHub repo. Pass --repo owner/name explicitly."})) | ||
| sys.exit(1) | ||
|
|
||
| if args.since is None and args.until is None: | ||
| args.since = parse_date_arg("7d") | ||
|
|
||
| effective_limit = args.limit or 50 |
There was a problem hiding this comment.
--limit default contradicts its own help text.
Help on line 290 says "default: 50 with --since, 20 otherwise", but line 309 unconditionally falls back to 50. Either fix the help text or branch on args.since.
Proposed fix
- effective_limit = args.limit or 50
+ effective_limit = args.limit or (50 if args.since else 20)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines
289 - 309, The --limit help text claims a different default depending on whether
--since is provided, but the code always sets effective_limit = args.limit or
50; update the logic to honor the help text by computing effective_limit based
on whether args.limit is provided and whether args.since is None (use 20 when no
--since, 50 when --since set), or alternatively change the
parser.add_argument("--limit"...) help string to state an unconditional default
of 50; locate the argument definition for "--limit" and the computation
effective_limit = args.limit or 50 and make the chosen single consistent fix so
help and behavior match.
| output: dict = { | ||
| "repo": repo, | ||
| "scanned": scanned_count, | ||
| "bots_excluded": bots_excluded, | ||
| "with_failures": with_failures, | ||
| "all_passing_count": all_passing_count, | ||
| "filters": { | ||
| "since": args.since.isoformat() if args.since else None, | ||
| "until": args.until.isoformat() if args.until else None, | ||
| "limit": effective_limit, | ||
| "limit_hit": limit_hit, | ||
| "bots_excluded": not args.include_bots, | ||
| "deep": args.deep, | ||
| "file_filter": args.file, | ||
| }, | ||
| "prs": pr_results, | ||
| "patterns": patterns, | ||
| } |
There was a problem hiding this comment.
filters.deep understates effective mode when --file is used.
--file implies deep (line 299, run_deep = args.deep or bool(args.file)) and is the trigger for emitting rerun_patterns/test_patterns. Reporting "deep": args.deep makes the filter block say deep: false even though the output contains the deep-mode sections — misleading for any consumer keying off the metadata.
Proposed fix
- "deep": args.deep,
+ "deep": run_deep,
"file_filter": args.file,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| output: dict = { | |
| "repo": repo, | |
| "scanned": scanned_count, | |
| "bots_excluded": bots_excluded, | |
| "with_failures": with_failures, | |
| "all_passing_count": all_passing_count, | |
| "filters": { | |
| "since": args.since.isoformat() if args.since else None, | |
| "until": args.until.isoformat() if args.until else None, | |
| "limit": effective_limit, | |
| "limit_hit": limit_hit, | |
| "bots_excluded": not args.include_bots, | |
| "deep": args.deep, | |
| "file_filter": args.file, | |
| }, | |
| "prs": pr_results, | |
| "patterns": patterns, | |
| } | |
| output: dict = { | |
| "repo": repo, | |
| "scanned": scanned_count, | |
| "bots_excluded": bots_excluded, | |
| "with_failures": with_failures, | |
| "all_passing_count": all_passing_count, | |
| "filters": { | |
| "since": args.since.isoformat() if args.since else None, | |
| "until": args.until.isoformat() if args.until else None, | |
| "limit": effective_limit, | |
| "limit_hit": limit_hit, | |
| "bots_excluded": not args.include_bots, | |
| "deep": run_deep, | |
| "file_filter": args.file, | |
| }, | |
| "prs": pr_results, | |
| "patterns": patterns, | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py` around lines
463 - 480, The filters.deep field in the final output currently uses args.deep
(which can be false even when --file forces deep mode); change it to report the
effective deep mode (use the existing run_deep variable or the equivalent
args.deep or bool(args.file) expression) so "filters": { "deep": run_deep, ... }
correctly reflects when deep-mode sections like rerun_patterns/test_patterns are
emitted.
| <For each ⚠️ Suspected Flaky test in the report, generate one block. Substitute all `<...>` placeholders except `<team label if known ...>` and `<team component if known>` — render those two verbatim so the user knows they must supply those values themselves.> | ||
|
|
||
| **"<test name>"** | ||
| ``` | ||
| Type: Bug | Priority: Normal | Labels: flaky-test, <team label if known e.g. dashboard-crimson-scrum for Gen AI Studio area> | Component: <team component if known> | ||
| Summary: test(mock): "<test name>" is intermittently failing in CI | ||
| ``` | ||
| Description: test `<file>`, area `<area>`, symptom `<error pattern>`, PRs: `<pr list>` | ||
| AC: Investigate root cause (timing race, missing guard, selector instability); fix; verify stable on RHOAI and ODH clusters; remove `@Maintain` tag. | ||
|
|
There was a problem hiding this comment.
Jira template is split across a fence boundary.
Line 296 opens ``` and line 299 closes it after Summary:, but `Description:` (line 300) and `AC:` (line 301) sit outside the fence as prose. Either close after `AC:` or split into two clearly separated blocks — right now the rendered output mixes a code block with plain text that looks like it should be part of the same ticket template.
Proposed fix
**"<test name>"**
-```
+```text
Type: Bug | Priority: Normal | Labels: flaky-test, <team label if known e.g. dashboard-crimson-scrum for Gen AI Studio area> | Component: <team component if known>
Summary: test(mock): "<test name>" is intermittently failing in CI
+Description: test `<file>`, area `<area>`, symptom `<error pattern>`, PRs: `<pr list>`
+AC: Investigate root cause (timing race, missing guard, selector instability); fix; verify stable on RHOAI and ODH clusters; remove `@Maintain` tag.-Description: test <file>, area <area>, symptom <error pattern>, PRs: <pr list>
-AC: Investigate root cause (timing race, missing guard, selector instability); fix; verify stable on RHOAI and ODH clusters; remove @Maintain tag.
</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>
[warning] 299-299: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @packages/gen-ai/.claude/skills/flake-check/SKILL.md around lines 293 - 302,
The Jira template's fenced code block after the """" header is
broken: the triple-backtick fence that begins before "Type: Bug ..." only
encloses through "Summary:" leaving "Description:" and "AC:" outside as plain
text; fix by moving the closing ``` so the "Description:" and "AC:" lines are
inside the same fenced block (or alternatively split into two clearly labeled
blocks), i.e., adjust the closing triple-backtick for the fenced block that
contains the "Type: Bug..." and "Summary: ..." lines so it also contains
"Description: ..." and "AC: ...".
</details>
<!-- fingerprinting:phantom:medusa:nectarine:d5d6d8e9-40a2-4a85-a995-ec4b9ea27067 -->
<!-- This is an auto-generated comment by CodeRabbit -->
| `scan_prs.py` returns: | ||
| - `prs` — list of PRs with their failing check names | ||
| - `patterns` — check names that **visibly failed** on more than one PR, each with `failure_count` and the list of PR numbers it failed on | ||
| - `rerun_patterns` — *(with `--deep` or `--file`)* check names that **failed then passed on the same commit SHA** on two or more PRs, with `rerun_count` and PR numbers. This surfaces flaky tests that devs routinely re-run — they won't appear in `patterns` because `statusCheckRollup` only shows the final passing state. | ||
| - `test_patterns` — *(with `--deep` or `--file`)* **individual test names** that failed on two or more PRs, each with `failure_count`, `pr_numbers`, and distinct `errors` seen. A test appearing here means the *same specific `it()` block* recurred across PRs — much stronger evidence than the same job recurring. With `--deep`: fetches logs for all failing checks not excluded by `_DETERMINISTIC_PREFIXES` across all scanned PRs (the same test can recur across different check matrix variants — this catches it). With `--file`: same log fetching but filtered to a specific file. | ||
| - `bots_excluded` — count of bot PRs filtered out | ||
| - `all_passing_count` — PRs where everything passed | ||
| - `filters` — the resolved `since`/`until`/`limit`/`deep` values actually used | ||
|
|
There was a problem hiding this comment.
Drift between scan_prs.py output and documented filters shape.
Line 369 lists filters as containing only since/until/limit/deep, but the script (scan_prs.py lines 469–477) also emits limit_hit, bots_excluded, and file_filter. Since the report template on line 398 already relies on filters.limit_hit, update this bullet so consumers don't think those fields are undocumented/unsupported.
-- `filters` — the resolved `since`/`until`/`limit`/`deep` values actually used
+- `filters` — the resolved `since`/`until`/`limit`/`limit_hit`/`deep`/`bots_excluded`/`file_filter` values actually used🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/gen-ai/.claude/skills/flake-check/SKILL.md` around lines 362 - 370,
The documented `filters` shape in SKILL.md is incomplete; update the bullet
under `scan_prs.py` so it lists the actual keys emitted by scan_prs.py
(including `limit_hit`, `bots_excluded`, and `file_filter` in addition to
`since`/`until`/`limit`/`deep`) so the docs match the script’s output and the
report template (which references `filters.limit_hit`). Locate the `scan_prs.py`
description paragraph and the filters bullet (the section that currently lists
`since`/`until`/`limit`/`deep`) and add the additional keys and brief
descriptions to reflect what the script emits.
https://redhat.atlassian.net/browse/RHOAIENG-58234
Description
This PR adds an experimental claude skill to help with scanning for flaky tests in our Github PRs, co-authored with claude.
Regarding the location of this skill - the Gen AI team are using
/packages/gen-ai/.claude/skillsfor initial testing of skills, later they may be moved to the root.claudefolder if they are adding value and useful to the wider team.How Has This Been Tested?
Test Impact
N/A
Request review criteria:
Self checklist (all need to be checked):
If you have UI changes:
After the PR is posted & before it merges:
mainSummary by CodeRabbit
New Features
Documentation