You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `/flake-check` skill investigates CI failures on pull requests and classifies them as **confirmed flaky**, **suspected flaky**, or **likely real**using a combination of live CI data, symptom pattern matching, and the registry in `flaky-tests.yaml`.
3
+
The `/flake-check` skill investigates CI failures tied to github PRs and highlights suspected flaky tests; using live CI data, symptom pattern matching, and cross-PR overlap analysis. A jira may be logged for a suspected/confirmed flaky test to track resolving it.
4
4
5
5
## Commands
6
6
@@ -9,22 +9,32 @@ The `/flake-check` skill investigates CI failures on pull requests and classifie
9
9
|`/flake-check <PR>`| Investigate a single PR — fetches failing checks, reads logs, classifies each failure |
10
10
|`/flake-check <PR> --deep`| Same as above, plus detects checks that previously failed then passed on the same commit SHA |
11
11
|`/flake-check scan`| Lightweight survey of the last 20 PRs for recurring failure patterns (no log fetching) |
12
+
|`/flake-check scan --deep`| Same as above, plus rerun detection and cross-PR code overlap analysis for each pattern |
12
13
|`/flake-check scan <N>d`| Survey PRs from the last N days (e.g. `scan 7d`) |
13
14
|`/flake-check scan --file <filename>`| Find every PR in the window where a specific test file appeared in failures (fetches logs — slower) |
14
-
|`/flake-check mark-flaky`| Register a confirmed flaky test in `flaky-tests.yaml`|
15
-
|`/flake-check stats`| Show registry trends — most impactful tests, area breakdown, recently active entries |
15
+
16
+
**Examples — copy/paste and substitute your own values:**
Flaky classification is never based on a single signal. The skill accumulates evidence across multiple dimensions and applies them conservatively — a real regression can produce the same symptoms as a flaky test.
22
36
23
-
### Signal 1 — Registry match (strongest)
24
-
25
-
The test name or file is already in `flaky-tests.yaml` with prior PR occurrences. This is the only signal that produces a **Confirmed Flaky** verdict without further investigation. Cite the entry's `resolution` field and act on it.
26
-
27
-
### Signal 2 — Cross-PR recurrence (strong)
37
+
### Signal 1 — Cross-PR recurrence (strong)
28
38
29
39
The same check or test file fails on multiple PRs whose code changes are in **unrelated areas**. For example, `pipelineCreateRuns.cy.ts` failing on a PR that only touched `api-keys/maas` code, and again on a PR that only touched `model-serving/` code — neither of which touches pipelines. When a test fails repeatedly across PRs with no common code thread, the failure is almost certainly independent of the code changes.
30
40
@@ -33,31 +43,21 @@ How to find this:
33
43
-`/flake-check scan --file <filename>` surfaces file-level recurrence with log-level detail
34
44
-`/flake-check <PR> --deep` + then checking other recent PRs manually
35
45
36
-
### Signal 3 — No code overlap on a single PR (moderate)
46
+
### Signal 2 — No code overlap on a single PR (moderate)
37
47
38
48
A test fails on a PR whose changes are entirely in a different feature area than the test exercises. For example, a pipelines test failing on a PR that only modifies authentication code. This is a moderate signal on its own — it means the failure is *likely* unrelated to the PR, but it could still be a pre-existing regression on `main`.
39
49
40
50
The skill performs this analysis automatically during PR investigation: it fetches the PR's changed files and compares the domain against the failing test's directory.
41
51
42
-
### Signal 4 — Rerun detection (moderate)
52
+
### Signal 3 — Rerun detection (moderate)
43
53
44
-
A check **failed then passed on the same commit SHA** without any new code being pushed. This means a developer triggered a re-run and it passed — a strong behavioural indicator that the failure was transient. These hidden failures don't appear in GitHub's final check status, so they're easy to miss.
54
+
A check **failed then passed on the same commit SHA** without any new code being pushed. This means a developer triggered a re-run and it passed — a strong behavioural indicator that the failure was transient.
45
55
46
56
How to find this:
47
57
-`/flake-check <PR> --deep` — reports `rerun_detected` entries for the specific PR
48
-
-`/flake-check scan --deep` — surfaces `rerun_patterns` across many PRs, identifying checks that developers routinely re-run to get past
-`/flake-check scan --deep` — surfaces `rerun_patterns` across many PRs
51
59
52
-
When a developer posts a `/retest` comment on a PR, they are manually triggering a CI rerun — the human-visible equivalent of the automated signal above. A PR with one or more `/retest` comments *may* indicate a flaky test, but this signal is weak on its own because:
53
-
54
-
-`/retest` often follows a real fix (e.g. after pushing a correction) — not every rerun is flakiness
55
-
- A PR with multiple `/retest` comments on a check that keeps failing is *more* suggestive of an intermittent issue
56
-
- This signal is only visible when manually reading PR comments; the skill does not scan for it automatically
57
-
58
-
Use it as a prompt for investigation, not as a classification. If you notice `/retest` comments while reviewing a PR and the check eventually passed, treat that as supporting evidence alongside Signal 1–3 above.
59
-
60
-
### Signal 5 — Symptom pattern match (weak, starting point only)
60
+
### Signal 4 — Symptom pattern match (weak, starting point only)
61
61
62
62
The error message matches a known timing or infrastructure error pattern:
63
63
@@ -71,51 +71,38 @@ The error message matches a known timing or infrastructure error pattern:
71
71
|`net::ERR_CONNECTION_REFUSED`| CI service failed to start or crashed |
72
72
|`Cannot read properties of null`| Race condition — component unmounted or not yet mounted |
73
73
74
-
**A symptom match is a starting signal, not a verdict.** Always cross-reference with code overlap and cross-PR recurrence before classifying. A broken selector, a missing mock, or a genuine product bug can produce identical error messages.
74
+
**A symptom match is a starting signal, not a verdict.** Always cross-reference with code overlap and cross-PR recurrence before classifying.
75
75
76
76
---
77
77
78
78
## Confidence Model
79
79
80
80
| Tier | Verdict | Criteria |
81
81
|---|---|---|
82
-
| 1 |**Confirmed Flaky**| Registry match in `flaky-tests.yaml`|
83
-
| 2 |**Suspected Flaky**| Symptom pattern match, or cross-PR recurrence with no code overlap, or rerun detection — but not in the registry |
84
-
| 3 |**Likely Real**| No registry match, no symptom pattern, and/or code overlap detected |
82
+
| 1 |**Suspected Flaky**| Symptom pattern match, or cross-PR recurrence with no code overlap, or rerun detection |
For suspected flaky, the skill further annotates based on code overlap:
87
-
-**No overlap** — failure is likely unrelated to this PR; safe to rerun, but register if it passes
86
+
-**No overlap** — failure is likely unrelated to this PR; safe to rerun, but log a Jira if it passes
88
87
-**Overlap detected** — a real regression is plausible; investigate before dismissing
89
88
-**Unclear** — PR spans many areas or the test area is ambiguous; treat with caution
90
89
91
90
---
92
91
93
92
## What to Do When You Suspect a Flaky Test
94
93
95
-
1.**Check if it's already registered** — `/flake-check stats` or look at `flaky-tests.yaml`
96
-
2.**Rerun the failing check** — if it passes, that's strong evidence of flakiness
97
-
3.**Check for cross-PR recurrence** — `/flake-check scan --file <filename>` to see if it's happened before
98
-
4.**Look at the test code** — is there a missing `cy.wait('@alias')`, a missing `.should('be.visible')` guard, or an obvious race condition?
99
-
5.**If confirmed flaky** — run `/flake-check mark-flaky` to register it, then raise a Jira to fix the root cause
94
+
1.**Rerun the failing check** — if it passes, that's strong evidence of flakiness
95
+
2.**Check for cross-PR recurrence** — `/flake-check scan --file <filename>` to see if it's happened before
96
+
3.**Look at the test code** — is there a missing `cy.wait('@alias')`, a missing `.should('be.visible')` guard, or an obvious race condition?
97
+
4.**If confirmed flaky** — say "create a Jira" and I'll search for an existing ticket and file a Task with label `flaky-test` if none exists, then fix the root cause
100
98
101
99
---
102
100
103
-
## The Registry (`flaky-tests.yaml`)
101
+
## Tracking Flaky Tests
104
102
105
-
Machine-readable source of truth for known flaky tests. Each entry has:
103
+
Suspected/Confirmed flaky tests can be tracked as **Jira Tasks** with the `flaky-test` label in RHOAIENG. When an investigation identifies a suspected flaky test with strong evidence, the report includes pre-filled Jira fields — say "create the Jira" to file it.
106
104
107
-
| Field | Description |
108
-
|---|---|
109
-
|`id`| Unique ID in `<area>-<NNN>` format (e.g. `pipelines-001`) |
110
-
|`test`| Exact test name from the `it()` block |
111
-
|`file`| Path relative to repo root |
112
-
|`area`| Short slug (e.g. `model-catalog`, `pipelines`, `workbenches`) |
113
-
|`symptoms`| List of error strings or patterns observed |
114
-
|`first_seen` / `last_seen`| ISO dates |
115
-
|`pr_occurrences`| PR numbers where this was observed |
116
-
|`status`|`suspected` / `confirmed` / `resolved`|
117
-
|`resolution`| What to do when this failure appears |
118
-
|`jira`| Tracking ticket (optional) |
119
-
|`notes`| Additional context (optional) |
120
-
121
-
Entries are created and updated via `/flake-check mark-flaky` — do not edit by hand unless correcting a mistake.
105
+
To find all open flaky test tickets:
106
+
```
107
+
project = RHOAIENG AND labels = "flaky-test" AND status != Done
0 commit comments