refactor --file logic to share more with --deep, remove some old references to mark-flaky and stats

jharan1 · jharan1 · commit a738c6732690 · 2026-04-21T10:03:36.000+01:00
diff --git a/packages/gen-ai/.claude/skills/flake-check/SKILL.md b/packages/gen-ai/.claude/skills/flake-check/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: flake-check
-description: Investigates CI failures tied to github PRs and highlights suspected flaky tests. Uses live CI data, symptom pattern matching, and cross-PR scanning.
+description: "Github PR flake investigator — /flake-check <PR> [--deep] | scan [Nd] [--deep] [--file <name>]"
 ---
 
 # Flake check — Investigates CI failures on a github odh-dashboard PR and classifies them
@@ -14,7 +14,7 @@ Investigate failing CI checks on a PR and classify each failure using live CI da
 | `/flake-check <PR>` | Investigate a single PR — fetches failing checks, reads logs, classifies each failure |
 | `/flake-check <PR> --deep` | Same as above, plus detects checks that previously failed then passed on the same commit SHA |
 | `/flake-check scan` | Lightweight survey of the last 20 PRs for recurring failure patterns (no log fetching) |
-| `/flake-check scan --deep` | Same as above, plus rerun detection and cross-PR code overlap analysis for each pattern |
+| `/flake-check scan --deep` | Same as above, plus rerun detection, test-level pattern analysis, and file-level overlap for each recurring test |
 | `/flake-check scan <N>d` | Survey PRs from the last N days (e.g. `scan 7d`) |
 | `/flake-check scan --file <filename>` | Find every PR in the window where a specific test file appeared in failures (fetches logs — slower) |
 
@@ -340,48 +340,49 @@ python3 <base_path>/scripts/scan_prs.py --since 7d
 python3 <base_path>/scripts/scan_prs.py --since 14d --until 7d   # 7-14 days ago
 python3 <base_path>/scripts/scan_prs.py --since 2026-04-01 --until 2026-04-15
 
-# Deep mode: also detect reruns and run cross-PR overlap analysis on patterns
+# Deep mode: rerun detection, cross-PR overlap analysis, and test-level patterns across all failing checks
 python3 <base_path>/scripts/scan_prs.py --since 7d --deep
 
 # File-scoped: find every PR where a specific test file appeared in failures (fetches CI logs — slower)
 python3 <base_path>/scripts/scan_prs.py --file pipelineCreateRuns.cy.ts
 python3 <base_path>/scripts/scan_prs.py --since 30d --file pipelineCreateRuns.cy.ts
 ```
 
-When `--file` is passed, `scan_prs.py` fetches CI logs for all failing non-deterministic checks across the scanned PRs (using the same logic as `fetch_test_failures.py`) and returns only PRs where the named file appeared in the failures. Each matching PR gains a `matched_tests` field listing the failing test names and errors from that file. Step 3b (cross-PR overlap) can be skipped — the file filter already provides test-level specificity.
+When `--file` is passed, `scan_prs.py` fetches CI logs for all failing checks not excluded by `_DETERMINISTIC_PREFIXES` across the scanned PRs (using the same logic as `fetch_test_failures.py`) and returns only PRs where the named file appeared in the failures. Each matching PR gains a `matched_tests` field listing the failing test names and errors from that file. Step 3 (test-file overlap) still applies — run it against `test_patterns` as normal.
 
 `scan_prs.py` returns:
 - `prs` — list of PRs with their failing check names
 - `patterns` — check names that **visibly failed** on more than one PR, each with `failure_rate` (failures / appearances) and the list of PR numbers it failed on
-- `rerun_patterns` — *(only with `--deep`)* check names that **failed then passed on the same commit SHA** on two or more PRs, with `rerun_count` and `rerun_rate` (reruns / PRs scanned). This surfaces flaky tests that devs routinely re-run — they won't appear in `patterns` because `statusCheckRollup` only shows the final passing state.
+- `rerun_patterns` — *(with `--deep` or `--file`)* check names that **failed then passed on the same commit SHA** on two or more PRs, with `rerun_count` and `rerun_rate` (reruns / PRs scanned). This surfaces flaky tests that devs routinely re-run — they won't appear in `patterns` because `statusCheckRollup` only shows the final passing state.
+- `test_patterns` — *(with `--deep` or `--file`)* **individual test names** that failed on two or more PRs, each with `failure_count`, `failure_rate`, `pr_numbers`, and distinct `errors` seen. A test appearing here means the *same specific `it()` block* recurred across PRs — much stronger evidence than the same job recurring. With `--deep`: fetches logs for all failing checks not excluded by `_DETERMINISTIC_PREFIXES` across all scanned PRs (the same test can recur across different check matrix variants — this catches it). With `--file`: same log fetching but filtered to a specific file.
 - `bots_excluded` — count of bot PRs filtered out
 - `all_passing_count` — PRs where everything passed
 - `filters` — the resolved `since`/`until`/`limit`/`deep` values actually used
 
 **Note on rates:** both `failure_rate` and `rerun_rate` are relative to the scan window. Always read them alongside the counts (e.g. `3/20 PRs = 15%`) rather than comparing rates across different scans.
 
-### Step 3 — Cross-PR overlap analysis (deep mode only)
+### Step 3 — Test-file overlap analysis (deep mode and file mode)
 
-*Skip this step when `--deep` was not passed.*
+*Skip this step when neither `--deep` nor `--file` was passed, or when `test_patterns` is empty.*
 
-For each check in `patterns`, fetch the changed files for every PR it failed on:
+For each entry in `test_patterns`, fetch the changed files for every PR it appeared on:
 
 ```bash
 gh pr view <pr_number> --json files --jq '[.files[].path]'
 ```
 
-Run this once per PR number across all patterns (deduplicate PR numbers to avoid redundant calls). Then for each pattern, assess whether the changed files across those PRs share a common feature area with the check's test area:
+Run this once per PR number across all test patterns (deduplicate to avoid redundant calls). Then for each test pattern, compare the PRs' changed files against the test's actual file path (use the directory component of `test_patterns[].file` as the feature area — e.g. `cypress/tests/mocked/pipelines/runs/` → pipelines/runs area):
 
-- **No overlap** — the PRs that triggered this failure touched unrelated areas → strong evidence the check is flaky independent of code changes; annotate as `⚠️ Suspected Flaky — strong signal (no PR overlap)`
-- **Partial overlap** — some PRs touched the relevant area, others did not → annotate as `⚠️ Suspected Flaky — mixed overlap`
-- **Full overlap** — all PRs that triggered this failure touched the same feature area → may reflect a persistent real regression rather than flakiness; annotate as `⚠️ Suspected Flaky — overlap detected (possible real regression)`
+- **No overlap** — none of the PRs where this test failed touched the same feature area as the test file → strong evidence the test is flaky independent of code changes
+- **Partial overlap** — some PRs touched the relevant area, others did not → may be flaky or a recurring real regression; note the split
+- **Full overlap** — all PRs touched the same feature area as the test file → the failures may reflect a persistent real regression rather than flakiness
 
 ### Step 4 — Generate the scan report
 
 **Signal assignment logic (in priority order):**
-1. `⚠️ Suspected Flaky` — check name appears in `patterns` (cross-PR recurrence); annotate with overlap verdict when `--deep` was used
+1. `⚠️ Suspected Flaky` — check name appears in `patterns` (cross-PR recurrence at the check level)
 2. `❌ Likely Real` — check name is clearly deterministic (Lint, Type-Check, Build, kustomize) — these don't flake
-3. `❓ Unknown` — check name is ambiguous or is a test runner check with no cross-PR pattern yet; look it up in the check name table in Step 1 of Main Investigation to classify it; note that no pattern at scan level doesn't mean the failure isn't flaky — it means there isn't enough signal yet without fetching logs
+3. `❓ Unknown` — check name is ambiguous or is a test runner check with no cross-PR pattern yet; note that no pattern at scan level doesn't mean the failure isn't flaky — it means there isn't enough signal yet without fetching logs
 
 ```
 ## Recent PR Scan — <since>–<until> | <N> PRs scanned (<bots_excluded> bots excluded)
@@ -393,17 +394,28 @@ Run this once per PR number across all patterns (deduplicate PR numbers to avoid
 
 | PR | Title | Author | Failed Checks | Job Type | Signal |
 |---|---|---|---|---|---|
-| #<n> | <title> | <author> | <check name(s)> | cypress-mock | ⚠️ Suspected Flaky — strong signal (no PR overlap) |
-| #<n> | <title> | <author> | <check name(s)> | cypress-mock | ⚠️ Suspected Flaky — mixed overlap |
+| #<n> | <title> | <author> | <check name(s)> | cypress-mock | ⚠️ Suspected Flaky |
 | #<n> | <title> | <author> | <check name(s)> | lint | ❌ Likely Real |
 | #<n> | <title> | <author> | <check name(s)> | jest | ❓ Unknown |
 
 ### Patterns Observed (visible failures)
 <For each entry in patterns:>
 - "<check_name>" (<job_type>) failed in <N>/<scanned> PRs (<rate>%)
-  - PRs: #<n> (<areas changed>), #<n> (<areas changed>), ...
-  - <If --deep:> Cross-PR overlap: <no overlap | mixed overlap | full overlap — brief explanation>
-  - Classify as: ⚠️ Suspected Flaky <with overlap annotation if --deep>
+  - PRs: #<n>, #<n>, ...
+  - Classify as: ⚠️ Suspected Flaky
+
+### Test-Level Patterns — only present with --deep or --file (when test_patterns is non-empty)
+<For each entry in test_patterns, incorporating overlap verdict from Step 3:>
+- **"<test_name>"** (`<file>`) failed on <N>/<scanned> PRs (<rate>%)
+  - PRs: #<n>, #<n>, ...
+  - Error(s): `<errors[0]>` <and any additional distinct errors>
+  - **Overlap:** <one of:>
+    - *No overlap* — none of the PRs touched the test's feature area (`<dir>`) → strong flaky signal; consider opening a Jira task to track it
+    - *Partial overlap* — some PRs touched `<dir>`, others did not → may be flaky or a recurring regression; investigate with `/flake-check <number>`
+    - *Full overlap* — all PRs touched `<dir>` → possible persistent regression rather than flakiness; investigate before dismissing
+
+<If --deep was used but test_patterns is empty or absent:>
+No individual test recurred across multiple PRs — different tests failed within the same jobs each time. Try a wider window or `/flake-check scan --file <filename>` for a targeted search.
 
 ### Rerun Patterns (hidden failures) — only present with --deep
 <For each entry in rerun_patterns:>
@@ -427,12 +439,18 @@ Use this format instead of the standard report when `filters.file_filter` is non
 | #<n> | <title> | <author> | <matched_tests[].name> | <check name> | ⚠️ Suspected Flaky |
 | #<n> | <title> | <author> | <matched_tests[].name> | <check name> | ❌ Likely Real |
 
-### Pattern Summary
-- "<file_filter>" appeared in failures on <N>/<scanned> PRs (<rate>%) via check "<check_name>"
+### Recurring Tests (same test on multiple PRs)
+<For each entry in test_patterns, run the file-level overlap check from Step 3 using test_patterns[].file:>
+- **"<test_name>"** (`<file>`) failed on <N>/<scanned> PRs (<rate>%)
   - PRs: #<n> (author: <author>), #<n> (author: <author>), ...
-  - <If N >= 3 PRs and authors differ:> Strong cross-PR recurrence across unrelated changes — high flaky signal; say "create a Jira" and I'll file a Task with label `flaky-test`
-  - <If N == 1:> Single occurrence — insufficient data to classify; investigate before assuming flaky
-  - <If N >= 2:> Consider asking me to create a Jira Task with label `flaky-test` if confirmed flaky
+  - Error(s): `<errors[0]>` <and any additional distinct errors>
+  - **Overlap:** <one of:>
+    - *No overlap* — none of the PRs touched the test's feature area → strong flaky signal; consider opening a Jira task to track it
+    - *Partial overlap* — some PRs touched the relevant area, others did not → investigate with `/flake-check <number>`
+    - *Full overlap* — all PRs touched the same area → possible persistent regression; investigate before dismissing
+
+<If test_patterns is empty:>
+No individual test recurred across multiple PRs — each PR had a unique failure within this file. The file may still be flaky (different tests flaking each time), or each failure may be real. Run `/flake-check <number>` on individual PRs to investigate.
 
 ### No matches found
 <Only present when with_failures == 0:>
diff --git a/packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py b/packages/gen-ai/.claude/skills/flake-check/scripts/scan_prs.py
@@ -16,16 +16,25 @@
 large matrix job suites.
 
 --deep additionally detects rerun patterns: checks that failed then passed on
-the same commit SHA (without new code being pushed). This analysis runs on the
-same per-PR check-runs data already fetched, so has no extra API cost.
+the same commit SHA (without new code being pushed). It also fetches CI logs
+for all failing checks not excluded by _DETERMINISTIC_PREFIXES to surface
+test-level recurrence (test_patterns). Both analyses run on already-fetched
+data or share the same log-fetching pass, so the extra cost is only the log
+fetches themselves.
+
+--file implies --deep: it runs the same rerun detection, log fetching (for
+checks not excluded by _DETERMINISTIC_PREFIXES), and test-level pattern
+analysis, then applies the filename as a post-step filter so only PRs where
+that file appeared in failures are returned.
 
 Usage:
     python3 scan_prs.py                          # last 20 PRs
     python3 scan_prs.py --since 7d               # PRs created in the last 7 days
     python3 scan_prs.py --since 14d --until 7d   # PRs from 7-14 days ago
     python3 scan_prs.py --since 2026-04-01 --until 2026-04-15
     python3 scan_prs.py --limit 30
-    python3 scan_prs.py --since 7d --deep        # also detect rerun patterns
+    python3 scan_prs.py --since 7d --deep        # rerun detection + test-level patterns
+    python3 scan_prs.py --file pipelineRuns.cy.ts  # same as --deep, filtered by file
 
 Output JSON shape:
     {
@@ -56,13 +65,23 @@
                 "pr_numbers": [7190, 7191, 7194]
             }
         ],
-        "rerun_patterns": [   # only present with --deep
+        "rerun_patterns": [   # present with --deep or --file (--file implies --deep)
             {
                 "check_name": "Cypress-Mock-Tests (mcpCatalog, ...)",
                 "rerun_count": 4,
                 "rerun_rate": 0.20,
                 "pr_numbers": [7190, 7191, 7194, 7200]
             }
+        ],
+        "test_patterns": [   # present with --deep or --file; tests that failed on 2+ PRs
+            {
+                "test_name": "User can create a pipeline run",
+                "file": "packages/cypress/cypress/tests/mocked/pipelines/pipelineCreateRuns.cy.ts",
+                "failure_count": 3,
+                "failure_rate": 0.15,
+                "pr_numbers": [7200, 7300, 7350],
+                "errors": ["Timed out retrying after 4000ms"]
+            }
         ]
     }
 """
@@ -268,6 +287,7 @@ def main() -> None:
     parser.add_argument("--file", default=None, metavar="FILENAME",
                         help="Filter to PRs where this filename appeared in failing test logs (fetches CI logs — slower)")
     args = parser.parse_args()
+    run_deep = args.deep or bool(args.file)
 
     repo = args.repo or detect_repo()
     if not repo:
@@ -325,7 +345,7 @@ def main() -> None:
     with ThreadPoolExecutor(max_workers=10) as executor:
         futures = {
             executor.submit(
-                fetch_pr_check_summary, repo, pr["headRefOid"], args.deep
+                fetch_pr_check_summary, repo, pr["headRefOid"], run_deep
             ): pr["number"]
             for pr in prs_raw
             if pr.get("headRefOid")
@@ -362,31 +382,33 @@ def main() -> None:
     # Preserve original scan count before any file-level filtering
     scanned_count = len(pr_results)
 
-    # --file: fetch test-level logs for non-deterministic failing checks and filter by filename
-    if args.file:
+    # Fetch test-level logs for all failing checks not excluded by _DETERMINISTIC_PREFIXES
+    # when --deep or --file is set. A single shared fetch covers both flags.
+    all_pr_tests: dict[int, list[dict]] = defaultdict(list)
+    if run_deep:
         fetch_tasks = [
             (pr["number"], check["run_id"], check.get("job_id"))
             for pr in pr_results
             for check in pr["failing_checks"]
             if _is_test_runner(check["name"]) and check.get("run_id")
         ]
-
-        pr_matched_tests: dict[int, list[dict]] = defaultdict(list)
         with ThreadPoolExecutor(max_workers=8) as executor:
-            futures = {
+            futures_map = {
                 executor.submit(_fetch_tests_for_check, run_id, job_id): pr_number
                 for pr_number, run_id, job_id in fetch_tasks
             }
-            for future in as_completed(futures):
-                pr_number = futures[future]
-                for t in future.result():
-                    if args.file.lower() in (t.get("file") or "").lower():
-                        pr_matched_tests[pr_number].append(t)
+            for future in as_completed(futures_map):
+                all_pr_tests[futures_map[future]].extend(future.result())
 
-        # Keep only PRs where the target file was found failing; rebuild failure_index
+    # --file: filter pr_results to only PRs where the target file appeared in failures;
+    # annotate each with matched_tests and rebuild failure_index over the reduced set.
+    if args.file:
         filtered: list[dict] = []
         for pr in pr_results:
-            matched = pr_matched_tests.get(pr["number"])
+            matched = [
+                t for t in all_pr_tests.get(pr["number"], [])
+                if args.file.lower() in (t.get("file") or "").lower()
+            ]
             if matched:
                 pr["matched_tests"] = matched
                 filtered.append(pr)
@@ -428,9 +450,9 @@ def main() -> None:
         "patterns": patterns,
     }
 
-    if args.deep:
+    if run_deep:
         total_prs = len(pr_results)
-        rerun_patterns = [
+        output["rerun_patterns"] = [
             {
                 "check_name": name,
                 "rerun_count": len(pr_nums),
@@ -440,7 +462,38 @@ def main() -> None:
             for name, pr_nums in sorted(rerun_index.items(), key=lambda kv: -len(kv[1]))
             if len(pr_nums) >= 2
         ]
-        output["rerun_patterns"] = rerun_patterns
+
+        # Build test-level patterns from the shared log fetch.
+        # --file mode: use only the file-matched tests stored in pr["matched_tests"].
+        # --deep mode: use all fetched tests across all PRs.
+        test_index: dict[str, dict] = {}
+        for pr in pr_results:
+            tests = pr.get("matched_tests", []) if args.file else all_pr_tests.get(pr["number"], [])
+            for t in tests:
+                name = t.get("name", "")
+                if not name:
+                    continue
+                if name not in test_index:
+                    test_index[name] = {"file": t.get("file", ""), "pr_numbers": [], "errors": []}
+                entry = test_index[name]
+                if pr["number"] not in entry["pr_numbers"]:
+                    entry["pr_numbers"].append(pr["number"])
+                err = t.get("error", "")
+                if err and err not in entry["errors"]:
+                    entry["errors"].append(err)
+
+        output["test_patterns"] = [
+            {
+                "test_name": name,
+                "file": data["file"],
+                "failure_count": len(data["pr_numbers"]),
+                "failure_rate": round(len(data["pr_numbers"]) / scanned_count, 2) if scanned_count else 1.0,
+                "pr_numbers": sorted(data["pr_numbers"], reverse=True),
+                "errors": data["errors"],
+            }
+            for name, data in sorted(test_index.items(), key=lambda kv: -len(kv[1]["pr_numbers"]))
+            if len(data["pr_numbers"]) >= 2
+        ]
 
     print(json.dumps(output, indent=2))