You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: packages/gen-ai/.claude/skills/flake-check/SKILL.md
+68-3Lines changed: 68 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,20 +1,81 @@
1
1
---
2
2
name: flake-check
3
-
description: Investigates what's blocking a PR by running scripts to deterministically collect CI data, then classifying failures as suspected flaky or likely real. Pass a PR number to investigate, or "scan [N]" to survey recent PRs for patterns.
3
+
description: Investigates CI failures tied to github PRs and highlights suspected flaky tests. Uses live CI data, symptom pattern matching, and cross-PR scanning.
4
4
---
5
5
6
6
# Flake check — Investigates CI failures on a github odh-dashboard PR and classifies them
7
7
8
8
Investigate failing CI checks on a PR and classify each failure using live CI data, symptom pattern matching, and cross-PR overlap analysis.
9
9
10
+
## Quick Reference
11
+
12
+
| Invocation | What it does |
13
+
|---|---|
14
+
|`/flake-check <PR>`| Investigate a single PR — fetches failing checks, reads logs, classifies each failure |
15
+
|`/flake-check <PR> --deep`| Same as above, plus detects checks that previously failed then passed on the same commit SHA |
16
+
|`/flake-check scan`| Lightweight survey of the last 20 PRs for recurring failure patterns (no log fetching) |
17
+
|`/flake-check scan --deep`| Same as above, plus rerun detection and cross-PR code overlap analysis for each pattern |
18
+
|`/flake-check scan <N>d`| Survey PRs from the last N days (e.g. `scan 7d`) |
19
+
|`/flake-check scan --file <filename>`| Find every PR in the window where a specific test file appeared in failures (fetches logs — slower) |
Flakiness is a pattern, not a single event. The more signals you see, the more confident you can be. In rough order of strength:
39
+
40
+
**Signal 1 — Cross-PR recurrence (strong)**
41
+
The same test or check fails on multiple unrelated PRs that touch different parts of the codebase. This is the strongest signal because it rules out a real regression: if the test fails whether or not you've changed the relevant code, the test itself is the problem.
42
+
43
+
*Example:*`pipelineCreateRuns.cy.ts` fails on a PR that only changes `model-serving/` — completely unrelated areas.
44
+
45
+
*How to find it:*
46
+
-`/flake-check scan` — lightweight survey of the last 20 PRs
47
+
-`/flake-check scan 30d --file pipelineCreateRuns.cy.ts` — targeted search for a specific file
48
+
49
+
**Signal 2 — Rerun detection (strong)**
50
+
The check failed on an earlier run, then passed on a later run of the same commit SHA without any code change in between. Same code, different result — the test is non-deterministic by definition.
51
+
52
+
**Signal 3 — No code overlap on a single PR (moderate)**
53
+
The failing test exercises feature area X, but the PR only touches feature area Y. Not conclusive on its own (unrelated changes can still expose race conditions in a shared subsystem), but it lowers suspicion that the PR caused the regression.
54
+
55
+
*How to find it:*`/flake-check <PR> --deep` — adds a "Previously Failed — Rerun Detected" section to the report.
56
+
57
+
**Signal 4 — Symptom pattern match (weak, starting point only)**
58
+
59
+
| Pattern | What it usually indicates |
60
+
|---------|--------------------------|
61
+
|`CypressError: Timed out retrying after`| DOM timing race |
62
+
|`cy.click() / cy.type() failed — requires a DOM element`| Element disappeared mid-test |
63
+
|`AssertionError: Timed out retrying`| Timing race — but check the error detail; a missing named element may be a real defect |
64
+
|`socket hang up` / `ECONNRESET` / `ERR_CONNECTION_REFUSED`| Infrastructure hiccup |
65
+
|`Cannot read properties of null`| Race condition in test setup |
66
+
67
+
A symptom match is a starting signal, not a verdict. Real failures produce identical patterns — always check the error detail and look for corroborating signals before concluding flakiness.
68
+
69
+
---
70
+
10
71
## Architecture
11
72
12
73
This skill separates **data collection** (deterministic Python scripts) from **analysis** (Claude reasoning):
13
74
14
75
| Phase | Who | What |
15
76
|---|---|---|
16
77
| Data collection | Scripts | Fetch PR state, CI logs — output clean JSON |
17
-
| Classification | Claude | Apply confidence model using symptom patterns and code overlap |
78
+
| Classification | Claude | Apply confidence model using symptom patterns, checking code overlap with failed tests, etc.|
18
79
| Report | Claude | Generate structured, actionable output |
19
80
20
81
Scripts live in `<base_path>/scripts/`. `<base_path>` is the absolute path to the directory containing this SKILL.md file — resolve it at skill load time from the skill's own path (e.g. `/Users/myuser/code/odh-dashboard/packages/gen-ai/.claude/skills/flake-check`). Run scripts from the repo root (the user's working directory) using their absolute path. All scripts output JSON to stdout; errors go to stderr.
@@ -29,7 +90,11 @@ Scripts have no external dependencies. They use only the Python standard library
29
90
- A PR number (e.g. `4821`) → run the **Main Investigation** workflow
30
91
- A PR number with `--deep` (e.g. `4821 --deep`) → Main Investigation with rerun detection
31
92
-`scan` or `scan <N>` → run the **scan** workflow (N PRs, default 20)
93
+
-`scan <N>d` (e.g. `scan 7d`) → scan PRs from the last N days
94
+
-`scan --since <date/period> --until <date/period>` → scan a specific time window (e.g. `--since 14d --until 7d` or `--since 2026-04-01 --until 2026-04-15`)
95
+
-`scan --deep` → scan with rerun detection and cross-PR code overlap analysis
32
96
-`scan --file <filename>` (e.g. `scan --file pipelineCreateRuns.cy.ts`) → run the **file-scoped scan** workflow — fetches CI logs to find every PR where that specific test file failed
97
+
- Any `scan` variant may combine modifiers (e.g. `scan 30d --deep --file pipelineCreateRuns.cy.ts`)
33
98
- Empty → ask the user for a PR number, then run **Main Investigation**
34
99
35
100
---
@@ -44,7 +109,7 @@ Apply this conservatively — a real regression can produce the same symptoms as
44
109
- If your analysis concludes the failure is likely deterministic (wrong selector, missing element, new test with a bug), classify as Likely real instead and explain why
45
110
- Do NOT label it flaky — surface it as a possibility
46
111
- Prompt the dev: is this related to their changes? Has this test passed on `main` recently?
47
-
- Recommended action: gather multiple signals to confirm — see "If You Confirm It's Flaky" in any investigation report; then log a RHOAIENG Task with label `flaky-test`
112
+
- Recommended action: gather multiple signals to confirm — see "If You Confirm It's Flaky" in any investigation report; then log a RHOAIENG Task with label `flaky-test` if not already tracked
48
113
49
114
**Likely real**
50
115
- No symptom pattern match and no cross-PR recurrence signal
0 commit comments