You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: packages/gen-ai/.claude/skills/flake-check/SKILL.md
+42-24Lines changed: 42 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
name: flake-check
3
-
description: Investigates CI failures tied to github PRs and highlights suspected flaky tests. Uses live CI data, symptom pattern matching, and cross-PR scanning.
When `--file` is passed, `scan_prs.py` fetches CI logs for all failing non-deterministic checks across the scanned PRs (using the same logic as `fetch_test_failures.py`) and returns only PRs where the named file appeared in the failures. Each matching PR gains a `matched_tests` field listing the failing test names and errors from that file. Step 3b (cross-PR overlap) can be skipped — the file filter already provides test-level specificity.
351
+
When `--file` is passed, `scan_prs.py` fetches CI logs for all failing checks not excluded by `_DETERMINISTIC_PREFIXES`across the scanned PRs (using the same logic as `fetch_test_failures.py`) and returns only PRs where the named file appeared in the failures. Each matching PR gains a `matched_tests` field listing the failing test names and errors from that file. Step 3 (test-file overlap) still applies — run it against `test_patterns` as normal.
352
352
353
353
`scan_prs.py` returns:
354
354
-`prs` — list of PRs with their failing check names
355
355
-`patterns` — check names that **visibly failed** on more than one PR, each with `failure_rate` (failures / appearances) and the list of PR numbers it failed on
356
-
-`rerun_patterns` — *(only with `--deep`)* check names that **failed then passed on the same commit SHA** on two or more PRs, with `rerun_count` and `rerun_rate` (reruns / PRs scanned). This surfaces flaky tests that devs routinely re-run — they won't appear in `patterns` because `statusCheckRollup` only shows the final passing state.
356
+
-`rerun_patterns` — *(with `--deep` or `--file`)* check names that **failed then passed on the same commit SHA** on two or more PRs, with `rerun_count` and `rerun_rate` (reruns / PRs scanned). This surfaces flaky tests that devs routinely re-run — they won't appear in `patterns` because `statusCheckRollup` only shows the final passing state.
357
+
-`test_patterns` — *(with `--deep` or `--file`)***individual test names** that failed on two or more PRs, each with `failure_count`, `failure_rate`, `pr_numbers`, and distinct `errors` seen. A test appearing here means the *same specific `it()` block* recurred across PRs — much stronger evidence than the same job recurring. With `--deep`: fetches logs for all failing checks not excluded by `_DETERMINISTIC_PREFIXES` across all scanned PRs (the same test can recur across different check matrix variants — this catches it). With `--file`: same log fetching but filtered to a specific file.
357
358
-`bots_excluded` — count of bot PRs filtered out
358
359
-`all_passing_count` — PRs where everything passed
359
360
-`filters` — the resolved `since`/`until`/`limit`/`deep` values actually used
360
361
361
362
**Note on rates:** both `failure_rate` and `rerun_rate` are relative to the scan window. Always read them alongside the counts (e.g. `3/20 PRs = 15%`) rather than comparing rates across different scans.
Run this once per PR number across all patterns (deduplicate PR numbers to avoid redundant calls). Then for each pattern, assess whether the changed files across those PRs share a common feature area with the check's test area:
374
+
Run this once per PR number across all test patterns (deduplicate to avoid redundant calls). Then for each test pattern, compare the PRs' changed files against the test's actual file path (use the directory component of `test_patterns[].file` as the feature area — e.g. `cypress/tests/mocked/pipelines/runs/` → pipelines/runs area):
374
375
375
-
-**No overlap** — the PRs that triggered this failure touched unrelated areas → strong evidence the check is flaky independent of code changes; annotate as `⚠️ Suspected Flaky — strong signal (no PR overlap)`
376
-
-**Partial overlap** — some PRs touched the relevant area, others did not → annotate as `⚠️ Suspected Flaky — mixed overlap`
377
-
-**Full overlap** — all PRs that triggered this failure touched the same feature area → may reflect a persistent real regression rather than flakiness; annotate as `⚠️ Suspected Flaky — overlap detected (possible real regression)`
376
+
-**No overlap** — none of the PRs where this test failed touched the same feature area as the test file → strong evidence the test is flaky independent of code changes
377
+
-**Partial overlap** — some PRs touched the relevant area, others did not → may be flaky or a recurring real regression; note the split
378
+
-**Full overlap** — all PRs touched the same feature area as the test file → the failures may reflect a persistent real regression rather than flakiness
378
379
379
380
### Step 4 — Generate the scan report
380
381
381
382
**Signal assignment logic (in priority order):**
382
-
1.`⚠️ Suspected Flaky` — check name appears in `patterns` (cross-PR recurrence); annotate with overlap verdict when `--deep` was used
383
+
1.`⚠️ Suspected Flaky` — check name appears in `patterns` (cross-PR recurrence at the check level)
383
384
2.`❌ Likely Real` — check name is clearly deterministic (Lint, Type-Check, Build, kustomize) — these don't flake
384
-
3.`❓ Unknown` — check name is ambiguous or is a test runner check with no cross-PR pattern yet; look it up in the check name table in Step 1 of Main Investigation to classify it; note that no pattern at scan level doesn't mean the failure isn't flaky — it means there isn't enough signal yet without fetching logs
385
+
3.`❓ Unknown` — check name is ambiguous or is a test runner check with no cross-PR pattern yet; note that no pattern at scan level doesn't mean the failure isn't flaky — it means there isn't enough signal yet without fetching logs
### Test-Level Patterns — only present with --deep or --file (when test_patterns is non-empty)
408
+
<For each entry in test_patterns, incorporating overlap verdict from Step 3:>
409
+
- **"<test_name>"** (`<file>`) failed on <N>/<scanned> PRs (<rate>%)
410
+
- PRs: #<n>, #<n>, ...
411
+
- Error(s): `<errors[0]>` <and any additional distinct errors>
412
+
- **Overlap:** <one of:>
413
+
- *No overlap* — none of the PRs touched the test's feature area (`<dir>`) → strong flaky signal; consider opening a Jira task to track it
414
+
- *Partial overlap* — some PRs touched `<dir>`, others did not → may be flaky or a recurring regression; investigate with `/flake-check <number>`
415
+
- *Full overlap* — all PRs touched `<dir>` → possible persistent regression rather than flakiness; investigate before dismissing
416
+
417
+
<If --deep was used but test_patterns is empty or absent:>
418
+
No individual test recurred across multiple PRs — different tests failed within the same jobs each time. Try a wider window or `/flake-check scan --file <filename>` for a targeted search.
407
419
408
420
### Rerun Patterns (hidden failures) — only present with --deep
409
421
<For each entry in rerun_patterns:>
@@ -427,12 +439,18 @@ Use this format instead of the standard report when `filters.file_filter` is non
- <If N >= 3 PRs and authors differ:> Strong cross-PR recurrence across unrelated changes — high flaky signal; say "create a Jira" and I'll file a Task with label `flaky-test`
434
-
- <If N == 1:> Single occurrence — insufficient data to classify; investigate before assuming flaky
435
-
- <If N >= 2:> Consider asking me to create a Jira Task with label `flaky-test` if confirmed flaky
446
+
- Error(s): `<errors[0]>` <and any additional distinct errors>
447
+
- **Overlap:** <one of:>
448
+
- *No overlap* — none of the PRs touched the test's feature area → strong flaky signal; consider opening a Jira task to track it
449
+
- *Partial overlap* — some PRs touched the relevant area, others did not → investigate with `/flake-check <number>`
450
+
- *Full overlap* — all PRs touched the same area → possible persistent regression; investigate before dismissing
451
+
452
+
<If test_patterns is empty:>
453
+
No individual test recurred across multiple PRs — each PR had a unique failure within this file. The file may still be flaky (different tests flaking each time), or each failure may be real. Run `/flake-check <number>` on individual PRs to investigate.
0 commit comments