Skip to content

Commit a738c67

Browse files
committed
refactor --file logic to share more with --deep, remove some old references to mark-flaky and stats
1 parent c4d7ad1 commit a738c67

2 files changed

Lines changed: 115 additions & 44 deletions

File tree

packages/gen-ai/.claude/skills/flake-check/SKILL.md

Lines changed: 42 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: flake-check
3-
description: Investigates CI failures tied to github PRs and highlights suspected flaky tests. Uses live CI data, symptom pattern matching, and cross-PR scanning.
3+
description: "Github PR flake investigator — /flake-check <PR> [--deep] | scan [Nd] [--deep] [--file <name>]"
44
---
55

66
# Flake check — Investigates CI failures on a github odh-dashboard PR and classifies them
@@ -14,7 +14,7 @@ Investigate failing CI checks on a PR and classify each failure using live CI da
1414
| `/flake-check <PR>` | Investigate a single PR — fetches failing checks, reads logs, classifies each failure |
1515
| `/flake-check <PR> --deep` | Same as above, plus detects checks that previously failed then passed on the same commit SHA |
1616
| `/flake-check scan` | Lightweight survey of the last 20 PRs for recurring failure patterns (no log fetching) |
17-
| `/flake-check scan --deep` | Same as above, plus rerun detection and cross-PR code overlap analysis for each pattern |
17+
| `/flake-check scan --deep` | Same as above, plus rerun detection, test-level pattern analysis, and file-level overlap for each recurring test |
1818
| `/flake-check scan <N>d` | Survey PRs from the last N days (e.g. `scan 7d`) |
1919
| `/flake-check scan --file <filename>` | Find every PR in the window where a specific test file appeared in failures (fetches logs — slower) |
2020

@@ -340,48 +340,49 @@ python3 <base_path>/scripts/scan_prs.py --since 7d
340340
python3 <base_path>/scripts/scan_prs.py --since 14d --until 7d # 7-14 days ago
341341
python3 <base_path>/scripts/scan_prs.py --since 2026-04-01 --until 2026-04-15
342342

343-
# Deep mode: also detect reruns and run cross-PR overlap analysis on patterns
343+
# Deep mode: rerun detection, cross-PR overlap analysis, and test-level patterns across all failing checks
344344
python3 <base_path>/scripts/scan_prs.py --since 7d --deep
345345

346346
# File-scoped: find every PR where a specific test file appeared in failures (fetches CI logs — slower)
347347
python3 <base_path>/scripts/scan_prs.py --file pipelineCreateRuns.cy.ts
348348
python3 <base_path>/scripts/scan_prs.py --since 30d --file pipelineCreateRuns.cy.ts
349349
```
350350

351-
When `--file` is passed, `scan_prs.py` fetches CI logs for all failing non-deterministic checks across the scanned PRs (using the same logic as `fetch_test_failures.py`) and returns only PRs where the named file appeared in the failures. Each matching PR gains a `matched_tests` field listing the failing test names and errors from that file. Step 3b (cross-PR overlap) can be skipped — the file filter already provides test-level specificity.
351+
When `--file` is passed, `scan_prs.py` fetches CI logs for all failing checks not excluded by `_DETERMINISTIC_PREFIXES` across the scanned PRs (using the same logic as `fetch_test_failures.py`) and returns only PRs where the named file appeared in the failures. Each matching PR gains a `matched_tests` field listing the failing test names and errors from that file. Step 3 (test-file overlap) still applies — run it against `test_patterns` as normal.
352352

353353
`scan_prs.py` returns:
354354
- `prs` — list of PRs with their failing check names
355355
- `patterns` — check names that **visibly failed** on more than one PR, each with `failure_rate` (failures / appearances) and the list of PR numbers it failed on
356-
- `rerun_patterns`*(only with `--deep`)* check names that **failed then passed on the same commit SHA** on two or more PRs, with `rerun_count` and `rerun_rate` (reruns / PRs scanned). This surfaces flaky tests that devs routinely re-run — they won't appear in `patterns` because `statusCheckRollup` only shows the final passing state.
356+
- `rerun_patterns`*(with `--deep` or `--file`)* check names that **failed then passed on the same commit SHA** on two or more PRs, with `rerun_count` and `rerun_rate` (reruns / PRs scanned). This surfaces flaky tests that devs routinely re-run — they won't appear in `patterns` because `statusCheckRollup` only shows the final passing state.
357+
- `test_patterns`*(with `--deep` or `--file`)* **individual test names** that failed on two or more PRs, each with `failure_count`, `failure_rate`, `pr_numbers`, and distinct `errors` seen. A test appearing here means the *same specific `it()` block* recurred across PRs — much stronger evidence than the same job recurring. With `--deep`: fetches logs for all failing checks not excluded by `_DETERMINISTIC_PREFIXES` across all scanned PRs (the same test can recur across different check matrix variants — this catches it). With `--file`: same log fetching but filtered to a specific file.
357358
- `bots_excluded` — count of bot PRs filtered out
358359
- `all_passing_count` — PRs where everything passed
359360
- `filters` — the resolved `since`/`until`/`limit`/`deep` values actually used
360361

361362
**Note on rates:** both `failure_rate` and `rerun_rate` are relative to the scan window. Always read them alongside the counts (e.g. `3/20 PRs = 15%`) rather than comparing rates across different scans.
362363

363-
### Step 3 — Cross-PR overlap analysis (deep mode only)
364+
### Step 3 — Test-file overlap analysis (deep mode and file mode)
364365

365-
*Skip this step when `--deep` was not passed.*
366+
*Skip this step when neither `--deep` nor `--file` was passed, or when `test_patterns` is empty.*
366367

367-
For each check in `patterns`, fetch the changed files for every PR it failed on:
368+
For each entry in `test_patterns`, fetch the changed files for every PR it appeared on:
368369

369370
```bash
370371
gh pr view <pr_number> --json files --jq '[.files[].path]'
371372
```
372373

373-
Run this once per PR number across all patterns (deduplicate PR numbers to avoid redundant calls). Then for each pattern, assess whether the changed files across those PRs share a common feature area with the check's test area:
374+
Run this once per PR number across all test patterns (deduplicate to avoid redundant calls). Then for each test pattern, compare the PRs' changed files against the test's actual file path (use the directory component of `test_patterns[].file` as the feature area — e.g. `cypress/tests/mocked/pipelines/runs/` → pipelines/runs area):
374375

375-
- **No overlap** — the PRs that triggered this failure touched unrelated areas → strong evidence the check is flaky independent of code changes; annotate as `⚠️ Suspected Flaky — strong signal (no PR overlap)`
376-
- **Partial overlap** — some PRs touched the relevant area, others did not → annotate as `⚠️ Suspected Flaky — mixed overlap`
377-
- **Full overlap** — all PRs that triggered this failure touched the same feature area may reflect a persistent real regression rather than flakiness; annotate as `⚠️ Suspected Flaky — overlap detected (possible real regression)`
376+
- **No overlap**none of the PRs where this test failed touched the same feature area as the test file → strong evidence the test is flaky independent of code changes
377+
- **Partial overlap** — some PRs touched the relevant area, others did not → may be flaky or a recurring real regression; note the split
378+
- **Full overlap** — all PRs touched the same feature area as the test file → the failures may reflect a persistent real regression rather than flakiness
378379

379380
### Step 4 — Generate the scan report
380381

381382
**Signal assignment logic (in priority order):**
382-
1. `⚠️ Suspected Flaky` — check name appears in `patterns` (cross-PR recurrence); annotate with overlap verdict when `--deep` was used
383+
1. `⚠️ Suspected Flaky` — check name appears in `patterns` (cross-PR recurrence at the check level)
383384
2. `❌ Likely Real` — check name is clearly deterministic (Lint, Type-Check, Build, kustomize) — these don't flake
384-
3. `❓ Unknown` — check name is ambiguous or is a test runner check with no cross-PR pattern yet; look it up in the check name table in Step 1 of Main Investigation to classify it; note that no pattern at scan level doesn't mean the failure isn't flaky — it means there isn't enough signal yet without fetching logs
385+
3. `❓ Unknown` — check name is ambiguous or is a test runner check with no cross-PR pattern yet; note that no pattern at scan level doesn't mean the failure isn't flaky — it means there isn't enough signal yet without fetching logs
385386

386387
```
387388
## Recent PR Scan — <since>–<until> | <N> PRs scanned (<bots_excluded> bots excluded)
@@ -393,17 +394,28 @@ Run this once per PR number across all patterns (deduplicate PR numbers to avoid
393394
394395
| PR | Title | Author | Failed Checks | Job Type | Signal |
395396
|---|---|---|---|---|---|
396-
| #<n> | <title> | <author> | <check name(s)> | cypress-mock | ⚠️ Suspected Flaky — strong signal (no PR overlap) |
397-
| #<n> | <title> | <author> | <check name(s)> | cypress-mock | ⚠️ Suspected Flaky — mixed overlap |
397+
| #<n> | <title> | <author> | <check name(s)> | cypress-mock | ⚠️ Suspected Flaky |
398398
| #<n> | <title> | <author> | <check name(s)> | lint | ❌ Likely Real |
399399
| #<n> | <title> | <author> | <check name(s)> | jest | ❓ Unknown |
400400
401401
### Patterns Observed (visible failures)
402402
<For each entry in patterns:>
403403
- "<check_name>" (<job_type>) failed in <N>/<scanned> PRs (<rate>%)
404-
- PRs: #<n> (<areas changed>), #<n> (<areas changed>), ...
405-
- <If --deep:> Cross-PR overlap: <no overlap | mixed overlap | full overlap — brief explanation>
406-
- Classify as: ⚠️ Suspected Flaky <with overlap annotation if --deep>
404+
- PRs: #<n>, #<n>, ...
405+
- Classify as: ⚠️ Suspected Flaky
406+
407+
### Test-Level Patterns — only present with --deep or --file (when test_patterns is non-empty)
408+
<For each entry in test_patterns, incorporating overlap verdict from Step 3:>
409+
- **"<test_name>"** (`<file>`) failed on <N>/<scanned> PRs (<rate>%)
410+
- PRs: #<n>, #<n>, ...
411+
- Error(s): `<errors[0]>` <and any additional distinct errors>
412+
- **Overlap:** <one of:>
413+
- *No overlap* — none of the PRs touched the test's feature area (`<dir>`) → strong flaky signal; consider opening a Jira task to track it
414+
- *Partial overlap* — some PRs touched `<dir>`, others did not → may be flaky or a recurring regression; investigate with `/flake-check <number>`
415+
- *Full overlap* — all PRs touched `<dir>` → possible persistent regression rather than flakiness; investigate before dismissing
416+
417+
<If --deep was used but test_patterns is empty or absent:>
418+
No individual test recurred across multiple PRs — different tests failed within the same jobs each time. Try a wider window or `/flake-check scan --file <filename>` for a targeted search.
407419
408420
### Rerun Patterns (hidden failures) — only present with --deep
409421
<For each entry in rerun_patterns:>
@@ -427,12 +439,18 @@ Use this format instead of the standard report when `filters.file_filter` is non
427439
| #<n> | <title> | <author> | <matched_tests[].name> | <check name> | ⚠️ Suspected Flaky |
428440
| #<n> | <title> | <author> | <matched_tests[].name> | <check name> | ❌ Likely Real |
429441
430-
### Pattern Summary
431-
- "<file_filter>" appeared in failures on <N>/<scanned> PRs (<rate>%) via check "<check_name>"
442+
### Recurring Tests (same test on multiple PRs)
443+
<For each entry in test_patterns, run the file-level overlap check from Step 3 using test_patterns[].file:>
444+
- **"<test_name>"** (`<file>`) failed on <N>/<scanned> PRs (<rate>%)
432445
- PRs: #<n> (author: <author>), #<n> (author: <author>), ...
433-
- <If N >= 3 PRs and authors differ:> Strong cross-PR recurrence across unrelated changes — high flaky signal; say "create a Jira" and I'll file a Task with label `flaky-test`
434-
- <If N == 1:> Single occurrence — insufficient data to classify; investigate before assuming flaky
435-
- <If N >= 2:> Consider asking me to create a Jira Task with label `flaky-test` if confirmed flaky
446+
- Error(s): `<errors[0]>` <and any additional distinct errors>
447+
- **Overlap:** <one of:>
448+
- *No overlap* — none of the PRs touched the test's feature area → strong flaky signal; consider opening a Jira task to track it
449+
- *Partial overlap* — some PRs touched the relevant area, others did not → investigate with `/flake-check <number>`
450+
- *Full overlap* — all PRs touched the same area → possible persistent regression; investigate before dismissing
451+
452+
<If test_patterns is empty:>
453+
No individual test recurred across multiple PRs — each PR had a unique failure within this file. The file may still be flaky (different tests flaking each time), or each failure may be real. Run `/flake-check <number>` on individual PRs to investigate.
436454
437455
### No matches found
438456
<Only present when with_failures == 0:>

packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py

Lines changed: 73 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,25 @@
1616
large matrix job suites.
1717
1818
--deep additionally detects rerun patterns: checks that failed then passed on
19-
the same commit SHA (without new code being pushed). This analysis runs on the
20-
same per-PR check-runs data already fetched, so has no extra API cost.
19+
the same commit SHA (without new code being pushed). It also fetches CI logs
20+
for all failing checks not excluded by _DETERMINISTIC_PREFIXES to surface
21+
test-level recurrence (test_patterns). Both analyses run on already-fetched
22+
data or share the same log-fetching pass, so the extra cost is only the log
23+
fetches themselves.
24+
25+
--file implies --deep: it runs the same rerun detection, log fetching (for
26+
checks not excluded by _DETERMINISTIC_PREFIXES), and test-level pattern
27+
analysis, then applies the filename as a post-step filter so only PRs where
28+
that file appeared in failures are returned.
2129
2230
Usage:
2331
python3 scan_prs.py # last 20 PRs
2432
python3 scan_prs.py --since 7d # PRs created in the last 7 days
2533
python3 scan_prs.py --since 14d --until 7d # PRs from 7-14 days ago
2634
python3 scan_prs.py --since 2026-04-01 --until 2026-04-15
2735
python3 scan_prs.py --limit 30
28-
python3 scan_prs.py --since 7d --deep # also detect rerun patterns
36+
python3 scan_prs.py --since 7d --deep # rerun detection + test-level patterns
37+
python3 scan_prs.py --file pipelineRuns.cy.ts # same as --deep, filtered by file
2938
3039
Output JSON shape:
3140
{
@@ -56,13 +65,23 @@
5665
"pr_numbers": [7190, 7191, 7194]
5766
}
5867
],
59-
"rerun_patterns": [ # only present with --deep
68+
"rerun_patterns": [ # present with --deep or --file (--file implies --deep)
6069
{
6170
"check_name": "Cypress-Mock-Tests (mcpCatalog, ...)",
6271
"rerun_count": 4,
6372
"rerun_rate": 0.20,
6473
"pr_numbers": [7190, 7191, 7194, 7200]
6574
}
75+
],
76+
"test_patterns": [ # present with --deep or --file; tests that failed on 2+ PRs
77+
{
78+
"test_name": "User can create a pipeline run",
79+
"file": "packages/cypress/cypress/tests/mocked/pipelines/pipelineCreateRuns.cy.ts",
80+
"failure_count": 3,
81+
"failure_rate": 0.15,
82+
"pr_numbers": [7200, 7300, 7350],
83+
"errors": ["Timed out retrying after 4000ms"]
84+
}
6685
]
6786
}
6887
"""
@@ -268,6 +287,7 @@ def main() -> None:
268287
parser.add_argument("--file", default=None, metavar="FILENAME",
269288
help="Filter to PRs where this filename appeared in failing test logs (fetches CI logs — slower)")
270289
args = parser.parse_args()
290+
run_deep = args.deep or bool(args.file)
271291

272292
repo = args.repo or detect_repo()
273293
if not repo:
@@ -325,7 +345,7 @@ def main() -> None:
325345
with ThreadPoolExecutor(max_workers=10) as executor:
326346
futures = {
327347
executor.submit(
328-
fetch_pr_check_summary, repo, pr["headRefOid"], args.deep
348+
fetch_pr_check_summary, repo, pr["headRefOid"], run_deep
329349
): pr["number"]
330350
for pr in prs_raw
331351
if pr.get("headRefOid")
@@ -362,31 +382,33 @@ def main() -> None:
362382
# Preserve original scan count before any file-level filtering
363383
scanned_count = len(pr_results)
364384

365-
# --file: fetch test-level logs for non-deterministic failing checks and filter by filename
366-
if args.file:
385+
# Fetch test-level logs for all failing checks not excluded by _DETERMINISTIC_PREFIXES
386+
# when --deep or --file is set. A single shared fetch covers both flags.
387+
all_pr_tests: dict[int, list[dict]] = defaultdict(list)
388+
if run_deep:
367389
fetch_tasks = [
368390
(pr["number"], check["run_id"], check.get("job_id"))
369391
for pr in pr_results
370392
for check in pr["failing_checks"]
371393
if _is_test_runner(check["name"]) and check.get("run_id")
372394
]
373-
374-
pr_matched_tests: dict[int, list[dict]] = defaultdict(list)
375395
with ThreadPoolExecutor(max_workers=8) as executor:
376-
futures = {
396+
futures_map = {
377397
executor.submit(_fetch_tests_for_check, run_id, job_id): pr_number
378398
for pr_number, run_id, job_id in fetch_tasks
379399
}
380-
for future in as_completed(futures):
381-
pr_number = futures[future]
382-
for t in future.result():
383-
if args.file.lower() in (t.get("file") or "").lower():
384-
pr_matched_tests[pr_number].append(t)
400+
for future in as_completed(futures_map):
401+
all_pr_tests[futures_map[future]].extend(future.result())
385402

386-
# Keep only PRs where the target file was found failing; rebuild failure_index
403+
# --file: filter pr_results to only PRs where the target file appeared in failures;
404+
# annotate each with matched_tests and rebuild failure_index over the reduced set.
405+
if args.file:
387406
filtered: list[dict] = []
388407
for pr in pr_results:
389-
matched = pr_matched_tests.get(pr["number"])
408+
matched = [
409+
t for t in all_pr_tests.get(pr["number"], [])
410+
if args.file.lower() in (t.get("file") or "").lower()
411+
]
390412
if matched:
391413
pr["matched_tests"] = matched
392414
filtered.append(pr)
@@ -428,9 +450,9 @@ def main() -> None:
428450
"patterns": patterns,
429451
}
430452

431-
if args.deep:
453+
if run_deep:
432454
total_prs = len(pr_results)
433-
rerun_patterns = [
455+
output["rerun_patterns"] = [
434456
{
435457
"check_name": name,
436458
"rerun_count": len(pr_nums),
@@ -440,7 +462,38 @@ def main() -> None:
440462
for name, pr_nums in sorted(rerun_index.items(), key=lambda kv: -len(kv[1]))
441463
if len(pr_nums) >= 2
442464
]
443-
output["rerun_patterns"] = rerun_patterns
465+
466+
# Build test-level patterns from the shared log fetch.
467+
# --file mode: use only the file-matched tests stored in pr["matched_tests"].
468+
# --deep mode: use all fetched tests across all PRs.
469+
test_index: dict[str, dict] = {}
470+
for pr in pr_results:
471+
tests = pr.get("matched_tests", []) if args.file else all_pr_tests.get(pr["number"], [])
472+
for t in tests:
473+
name = t.get("name", "")
474+
if not name:
475+
continue
476+
if name not in test_index:
477+
test_index[name] = {"file": t.get("file", ""), "pr_numbers": [], "errors": []}
478+
entry = test_index[name]
479+
if pr["number"] not in entry["pr_numbers"]:
480+
entry["pr_numbers"].append(pr["number"])
481+
err = t.get("error", "")
482+
if err and err not in entry["errors"]:
483+
entry["errors"].append(err)
484+
485+
output["test_patterns"] = [
486+
{
487+
"test_name": name,
488+
"file": data["file"],
489+
"failure_count": len(data["pr_numbers"]),
490+
"failure_rate": round(len(data["pr_numbers"]) / scanned_count, 2) if scanned_count else 1.0,
491+
"pr_numbers": sorted(data["pr_numbers"], reverse=True),
492+
"errors": data["errors"],
493+
}
494+
for name, data in sorted(test_index.items(), key=lambda kv: -len(kv[1]["pr_numbers"]))
495+
if len(data["pr_numbers"]) >= 2
496+
]
444497

445498
print(json.dumps(output, indent=2))
446499

0 commit comments

Comments
 (0)