|
| 1 | +# flake-check — Skill README |
| 2 | + |
| 3 | +The `/flake-check` skill investigates CI failures on pull requests and classifies them as **confirmed flaky**, **suspected flaky**, or **likely real** using a combination of live CI data, symptom pattern matching, and the registry in `flaky-tests.yaml`. |
| 4 | + |
| 5 | +## Commands |
| 6 | + |
| 7 | +| Invocation | What it does | |
| 8 | +|---|---| |
| 9 | +| `/flake-check <PR>` | Investigate a single PR — fetches failing checks, reads logs, classifies each failure | |
| 10 | +| `/flake-check <PR> --deep` | Same as above, plus detects checks that previously failed then passed on the same commit SHA | |
| 11 | +| `/flake-check scan` | Lightweight survey of the last 20 PRs for recurring failure patterns (no log fetching) | |
| 12 | +| `/flake-check scan <N>d` | Survey PRs from the last N days (e.g. `scan 7d`) | |
| 13 | +| `/flake-check scan --file <filename>` | Find every PR in the window where a specific test file appeared in failures (fetches logs — slower) | |
| 14 | +| `/flake-check mark-flaky` | Register a confirmed flaky test in `flaky-tests.yaml` | |
| 15 | +| `/flake-check stats` | Show registry trends — most impactful tests, area breakdown, recently active entries | |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## How We Identify Flaky Tests |
| 20 | + |
| 21 | +Flaky classification is never based on a single signal. The skill accumulates evidence across multiple dimensions and applies them conservatively — a real regression can produce the same symptoms as a flaky test. |
| 22 | + |
| 23 | +### Signal 1 — Registry match (strongest) |
| 24 | + |
| 25 | +The test name or file is already in `flaky-tests.yaml` with prior PR occurrences. This is the only signal that produces a **Confirmed Flaky** verdict without further investigation. Cite the entry's `resolution` field and act on it. |
| 26 | + |
| 27 | +### Signal 2 — Cross-PR recurrence (strong) |
| 28 | + |
| 29 | +The same check or test file fails on multiple PRs whose code changes are in **unrelated areas**. For example, `pipelineCreateRuns.cy.ts` failing on a PR that only touched `api-keys/maas` code, and again on a PR that only touched `model-serving/` code — neither of which touches pipelines. When a test fails repeatedly across PRs with no common code thread, the failure is almost certainly independent of the code changes. |
| 30 | + |
| 31 | +How to find this: |
| 32 | +- `/flake-check scan` surfaces check-level recurrence across the last N PRs |
| 33 | +- `/flake-check scan --file <filename>` surfaces file-level recurrence with log-level detail |
| 34 | +- `/flake-check <PR> --deep` + then checking other recent PRs manually |
| 35 | + |
| 36 | +### Signal 3 — No code overlap on a single PR (moderate) |
| 37 | + |
| 38 | +A test fails on a PR whose changes are entirely in a different feature area than the test exercises. For example, a pipelines test failing on a PR that only modifies authentication code. This is a moderate signal on its own — it means the failure is *likely* unrelated to the PR, but it could still be a pre-existing regression on `main`. |
| 39 | + |
| 40 | +The skill performs this analysis automatically during PR investigation: it fetches the PR's changed files and compares the domain against the failing test's directory. |
| 41 | + |
| 42 | +### Signal 4 — Rerun detection (moderate) |
| 43 | + |
| 44 | +A check **failed then passed on the same commit SHA** without any new code being pushed. This means a developer triggered a re-run and it passed — a strong behavioural indicator that the failure was transient. These hidden failures don't appear in GitHub's final check status, so they're easy to miss. |
| 45 | + |
| 46 | +How to find this: |
| 47 | +- `/flake-check <PR> --deep` — reports `rerun_detected` entries for the specific PR |
| 48 | +- `/flake-check scan --deep` — surfaces `rerun_patterns` across many PRs, identifying checks that developers routinely re-run to get past |
| 49 | + |
| 50 | +### Signal 5 — Symptom pattern match (weak, starting point only) |
| 51 | + |
| 52 | +The error message matches a known timing or infrastructure error pattern: |
| 53 | + |
| 54 | +| Pattern | What it usually indicates | |
| 55 | +|---|---| |
| 56 | +| `CypressError: Timed out retrying after` | Race condition — element didn't become interactive in time | |
| 57 | +| `cy.click() failed because it requires a DOM element` | Element disappeared or never mounted | |
| 58 | +| `cy.type() failed because it requires a DOM element` | Same as above for input fields | |
| 59 | +| `AssertionError: Timed out retrying` | Assertion never became true — **distinguish**: if it names a specific element that should always exist, this may be a real defect | |
| 60 | +| `socket hang up` / `ECONNRESET` | Network instability in CI | |
| 61 | +| `net::ERR_CONNECTION_REFUSED` | CI service failed to start or crashed | |
| 62 | +| `Cannot read properties of null` | Race condition — component unmounted or not yet mounted | |
| 63 | + |
| 64 | +**A symptom match is a starting signal, not a verdict.** Always cross-reference with code overlap and cross-PR recurrence before classifying. A broken selector, a missing mock, or a genuine product bug can produce identical error messages. |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## Confidence Model |
| 69 | + |
| 70 | +| Tier | Verdict | Criteria | |
| 71 | +|---|---|---| |
| 72 | +| 1 | **Confirmed Flaky** | Registry match in `flaky-tests.yaml` | |
| 73 | +| 2 | **Suspected Flaky** | Symptom pattern match, or cross-PR recurrence with no code overlap, or rerun detection — but not in the registry | |
| 74 | +| 3 | **Likely Real** | No registry match, no symptom pattern, and/or code overlap detected | |
| 75 | + |
| 76 | +For suspected flaky, the skill further annotates based on code overlap: |
| 77 | +- **No overlap** — failure is likely unrelated to this PR; safe to rerun, but register if it passes |
| 78 | +- **Overlap detected** — a real regression is plausible; investigate before dismissing |
| 79 | +- **Unclear** — PR spans many areas or the test area is ambiguous; treat with caution |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## What to Do When You Suspect a Flaky Test |
| 84 | + |
| 85 | +1. **Check if it's already registered** — `/flake-check stats` or look at `flaky-tests.yaml` |
| 86 | +2. **Rerun the failing check** — if it passes, that's strong evidence of flakiness |
| 87 | +3. **Check for cross-PR recurrence** — `/flake-check scan --file <filename>` to see if it's happened before |
| 88 | +4. **Look at the test code** — is there a missing `cy.wait('@alias')`, a missing `.should('be.visible')` guard, or an obvious race condition? |
| 89 | +5. **If confirmed flaky** — run `/flake-check mark-flaky` to register it, then raise a Jira to fix the root cause |
| 90 | + |
| 91 | +--- |
| 92 | + |
| 93 | +## The Registry (`flaky-tests.yaml`) |
| 94 | + |
| 95 | +Machine-readable source of truth for known flaky tests. Each entry has: |
| 96 | + |
| 97 | +| Field | Description | |
| 98 | +|---|---| |
| 99 | +| `id` | Unique ID in `<area>-<NNN>` format (e.g. `pipelines-001`) | |
| 100 | +| `test` | Exact test name from the `it()` block | |
| 101 | +| `file` | Path relative to repo root | |
| 102 | +| `area` | Short slug (e.g. `model-catalog`, `pipelines`, `workbenches`) | |
| 103 | +| `symptoms` | List of error strings or patterns observed | |
| 104 | +| `first_seen` / `last_seen` | ISO dates | |
| 105 | +| `pr_occurrences` | PR numbers where this was observed | |
| 106 | +| `status` | `active` / `intermittent` / `resolved` | |
| 107 | +| `resolution` | What to do when this failure appears | |
| 108 | +| `jira` | Tracking ticket (optional) | |
| 109 | +| `notes` | Additional context (optional) | |
| 110 | + |
| 111 | +Entries are created and updated via `/flake-check mark-flaky` — do not edit by hand unless correcting a mistake. |
0 commit comments