Skip to content

Commit 9c01c62

Browse files
committed
add WIP flake-check skill to /packages/gen-ai/.claude/skills/flake-check
1 parent ab20759 commit 9c01c62

9 files changed

Lines changed: 2409 additions & 0 deletions

File tree

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# flake-check — Skill README
2+
3+
The `/flake-check` skill investigates CI failures on pull requests and classifies them as **confirmed flaky**, **suspected flaky**, or **likely real** using a combination of live CI data, symptom pattern matching, and the registry in `flaky-tests.yaml`.
4+
5+
## Commands
6+
7+
| Invocation | What it does |
8+
|---|---|
9+
| `/flake-check <PR>` | Investigate a single PR — fetches failing checks, reads logs, classifies each failure |
10+
| `/flake-check <PR> --deep` | Same as above, plus detects checks that previously failed then passed on the same commit SHA |
11+
| `/flake-check scan` | Lightweight survey of the last 20 PRs for recurring failure patterns (no log fetching) |
12+
| `/flake-check scan <N>d` | Survey PRs from the last N days (e.g. `scan 7d`) |
13+
| `/flake-check scan --file <filename>` | Find every PR in the window where a specific test file appeared in failures (fetches logs — slower) |
14+
| `/flake-check mark-flaky` | Register a confirmed flaky test in `flaky-tests.yaml` |
15+
| `/flake-check stats` | Show registry trends — most impactful tests, area breakdown, recently active entries |
16+
17+
---
18+
19+
## How We Identify Flaky Tests
20+
21+
Flaky classification is never based on a single signal. The skill accumulates evidence across multiple dimensions and applies them conservatively — a real regression can produce the same symptoms as a flaky test.
22+
23+
### Signal 1 — Registry match (strongest)
24+
25+
The test name or file is already in `flaky-tests.yaml` with prior PR occurrences. This is the only signal that produces a **Confirmed Flaky** verdict without further investigation. Cite the entry's `resolution` field and act on it.
26+
27+
### Signal 2 — Cross-PR recurrence (strong)
28+
29+
The same check or test file fails on multiple PRs whose code changes are in **unrelated areas**. For example, `pipelineCreateRuns.cy.ts` failing on a PR that only touched `api-keys/maas` code, and again on a PR that only touched `model-serving/` code — neither of which touches pipelines. When a test fails repeatedly across PRs with no common code thread, the failure is almost certainly independent of the code changes.
30+
31+
How to find this:
32+
- `/flake-check scan` surfaces check-level recurrence across the last N PRs
33+
- `/flake-check scan --file <filename>` surfaces file-level recurrence with log-level detail
34+
- `/flake-check <PR> --deep` + then checking other recent PRs manually
35+
36+
### Signal 3 — No code overlap on a single PR (moderate)
37+
38+
A test fails on a PR whose changes are entirely in a different feature area than the test exercises. For example, a pipelines test failing on a PR that only modifies authentication code. This is a moderate signal on its own — it means the failure is *likely* unrelated to the PR, but it could still be a pre-existing regression on `main`.
39+
40+
The skill performs this analysis automatically during PR investigation: it fetches the PR's changed files and compares the domain against the failing test's directory.
41+
42+
### Signal 4 — Rerun detection (moderate)
43+
44+
A check **failed then passed on the same commit SHA** without any new code being pushed. This means a developer triggered a re-run and it passed — a strong behavioural indicator that the failure was transient. These hidden failures don't appear in GitHub's final check status, so they're easy to miss.
45+
46+
How to find this:
47+
- `/flake-check <PR> --deep` — reports `rerun_detected` entries for the specific PR
48+
- `/flake-check scan --deep` — surfaces `rerun_patterns` across many PRs, identifying checks that developers routinely re-run to get past
49+
50+
### Signal 5 — Symptom pattern match (weak, starting point only)
51+
52+
The error message matches a known timing or infrastructure error pattern:
53+
54+
| Pattern | What it usually indicates |
55+
|---|---|
56+
| `CypressError: Timed out retrying after` | Race condition — element didn't become interactive in time |
57+
| `cy.click() failed because it requires a DOM element` | Element disappeared or never mounted |
58+
| `cy.type() failed because it requires a DOM element` | Same as above for input fields |
59+
| `AssertionError: Timed out retrying` | Assertion never became true — **distinguish**: if it names a specific element that should always exist, this may be a real defect |
60+
| `socket hang up` / `ECONNRESET` | Network instability in CI |
61+
| `net::ERR_CONNECTION_REFUSED` | CI service failed to start or crashed |
62+
| `Cannot read properties of null` | Race condition — component unmounted or not yet mounted |
63+
64+
**A symptom match is a starting signal, not a verdict.** Always cross-reference with code overlap and cross-PR recurrence before classifying. A broken selector, a missing mock, or a genuine product bug can produce identical error messages.
65+
66+
---
67+
68+
## Confidence Model
69+
70+
| Tier | Verdict | Criteria |
71+
|---|---|---|
72+
| 1 | **Confirmed Flaky** | Registry match in `flaky-tests.yaml` |
73+
| 2 | **Suspected Flaky** | Symptom pattern match, or cross-PR recurrence with no code overlap, or rerun detection — but not in the registry |
74+
| 3 | **Likely Real** | No registry match, no symptom pattern, and/or code overlap detected |
75+
76+
For suspected flaky, the skill further annotates based on code overlap:
77+
- **No overlap** — failure is likely unrelated to this PR; safe to rerun, but register if it passes
78+
- **Overlap detected** — a real regression is plausible; investigate before dismissing
79+
- **Unclear** — PR spans many areas or the test area is ambiguous; treat with caution
80+
81+
---
82+
83+
## What to Do When You Suspect a Flaky Test
84+
85+
1. **Check if it's already registered**`/flake-check stats` or look at `flaky-tests.yaml`
86+
2. **Rerun the failing check** — if it passes, that's strong evidence of flakiness
87+
3. **Check for cross-PR recurrence**`/flake-check scan --file <filename>` to see if it's happened before
88+
4. **Look at the test code** — is there a missing `cy.wait('@alias')`, a missing `.should('be.visible')` guard, or an obvious race condition?
89+
5. **If confirmed flaky** — run `/flake-check mark-flaky` to register it, then raise a Jira to fix the root cause
90+
91+
---
92+
93+
## The Registry (`flaky-tests.yaml`)
94+
95+
Machine-readable source of truth for known flaky tests. Each entry has:
96+
97+
| Field | Description |
98+
|---|---|
99+
| `id` | Unique ID in `<area>-<NNN>` format (e.g. `pipelines-001`) |
100+
| `test` | Exact test name from the `it()` block |
101+
| `file` | Path relative to repo root |
102+
| `area` | Short slug (e.g. `model-catalog`, `pipelines`, `workbenches`) |
103+
| `symptoms` | List of error strings or patterns observed |
104+
| `first_seen` / `last_seen` | ISO dates |
105+
| `pr_occurrences` | PR numbers where this was observed |
106+
| `status` | `active` / `intermittent` / `resolved` |
107+
| `resolution` | What to do when this failure appears |
108+
| `jira` | Tracking ticket (optional) |
109+
| `notes` | Additional context (optional) |
110+
111+
Entries are created and updated via `/flake-check mark-flaky` — do not edit by hand unless correcting a mistake.

0 commit comments

Comments
 (0)