Skip to content

Commit c4d7ad1

Browse files
committed
remove readme and update SKILL.md to include some of previous readme
1 parent f95c890 commit c4d7ad1

2 files changed

Lines changed: 68 additions & 125 deletions

File tree

packages/gen-ai/.claude/skills/flake-check/README.md

Lines changed: 0 additions & 122 deletions
This file was deleted.

packages/gen-ai/.claude/skills/flake-check/SKILL.md

Lines changed: 68 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,81 @@
11
---
22
name: flake-check
3-
description: Investigates what's blocking a PR by running scripts to deterministically collect CI data, then classifying failures as suspected flaky or likely real. Pass a PR number to investigate, or "scan [N]" to survey recent PRs for patterns.
3+
description: Investigates CI failures tied to github PRs and highlights suspected flaky tests. Uses live CI data, symptom pattern matching, and cross-PR scanning.
44
---
55

66
# Flake check — Investigates CI failures on a github odh-dashboard PR and classifies them
77

88
Investigate failing CI checks on a PR and classify each failure using live CI data, symptom pattern matching, and cross-PR overlap analysis.
99

10+
## Quick Reference
11+
12+
| Invocation | What it does |
13+
|---|---|
14+
| `/flake-check <PR>` | Investigate a single PR — fetches failing checks, reads logs, classifies each failure |
15+
| `/flake-check <PR> --deep` | Same as above, plus detects checks that previously failed then passed on the same commit SHA |
16+
| `/flake-check scan` | Lightweight survey of the last 20 PRs for recurring failure patterns (no log fetching) |
17+
| `/flake-check scan --deep` | Same as above, plus rerun detection and cross-PR code overlap analysis for each pattern |
18+
| `/flake-check scan <N>d` | Survey PRs from the last N days (e.g. `scan 7d`) |
19+
| `/flake-check scan --file <filename>` | Find every PR in the window where a specific test file appeared in failures (fetches logs — slower) |
20+
21+
```
22+
/flake-check 7301
23+
/flake-check 7301 --deep
24+
/flake-check scan
25+
/flake-check scan --deep
26+
/flake-check scan 7d
27+
/flake-check scan 14d --deep
28+
/flake-check scan --file pipelineCreateRuns.cy.ts
29+
/flake-check scan 30d --file pipelineCreateRuns.cy.ts
30+
/flake-check scan --since 14d --until 7d
31+
/flake-check scan --since 2026-04-01 --until 2026-04-15
32+
```
33+
34+
---
35+
36+
## How We Identify Flaky Tests
37+
38+
Flakiness is a pattern, not a single event. The more signals you see, the more confident you can be. In rough order of strength:
39+
40+
**Signal 1 — Cross-PR recurrence (strong)**
41+
The same test or check fails on multiple unrelated PRs that touch different parts of the codebase. This is the strongest signal because it rules out a real regression: if the test fails whether or not you've changed the relevant code, the test itself is the problem.
42+
43+
*Example:* `pipelineCreateRuns.cy.ts` fails on a PR that only changes `model-serving/` — completely unrelated areas.
44+
45+
*How to find it:*
46+
- `/flake-check scan` — lightweight survey of the last 20 PRs
47+
- `/flake-check scan 30d --file pipelineCreateRuns.cy.ts` — targeted search for a specific file
48+
49+
**Signal 2 — Rerun detection (strong)**
50+
The check failed on an earlier run, then passed on a later run of the same commit SHA without any code change in between. Same code, different result — the test is non-deterministic by definition.
51+
52+
**Signal 3 — No code overlap on a single PR (moderate)**
53+
The failing test exercises feature area X, but the PR only touches feature area Y. Not conclusive on its own (unrelated changes can still expose race conditions in a shared subsystem), but it lowers suspicion that the PR caused the regression.
54+
55+
*How to find it:* `/flake-check <PR> --deep` — adds a "Previously Failed — Rerun Detected" section to the report.
56+
57+
**Signal 4 — Symptom pattern match (weak, starting point only)**
58+
59+
| Pattern | What it usually indicates |
60+
|---------|--------------------------|
61+
| `CypressError: Timed out retrying after` | DOM timing race |
62+
| `cy.click() / cy.type() failed — requires a DOM element` | Element disappeared mid-test |
63+
| `AssertionError: Timed out retrying` | Timing race — but check the error detail; a missing named element may be a real defect |
64+
| `socket hang up` / `ECONNRESET` / `ERR_CONNECTION_REFUSED` | Infrastructure hiccup |
65+
| `Cannot read properties of null` | Race condition in test setup |
66+
67+
A symptom match is a starting signal, not a verdict. Real failures produce identical patterns — always check the error detail and look for corroborating signals before concluding flakiness.
68+
69+
---
70+
1071
## Architecture
1172

1273
This skill separates **data collection** (deterministic Python scripts) from **analysis** (Claude reasoning):
1374

1475
| Phase | Who | What |
1576
|---|---|---|
1677
| Data collection | Scripts | Fetch PR state, CI logs — output clean JSON |
17-
| Classification | Claude | Apply confidence model using symptom patterns and code overlap |
78+
| Classification | Claude | Apply confidence model using symptom patterns, checking code overlap with failed tests, etc. |
1879
| Report | Claude | Generate structured, actionable output |
1980

2081
Scripts live in `<base_path>/scripts/`. `<base_path>` is the absolute path to the directory containing this SKILL.md file — resolve it at skill load time from the skill's own path (e.g. `/Users/myuser/code/odh-dashboard/packages/gen-ai/.claude/skills/flake-check`). Run scripts from the repo root (the user's working directory) using their absolute path. All scripts output JSON to stdout; errors go to stderr.
@@ -29,7 +90,11 @@ Scripts have no external dependencies. They use only the Python standard library
2990
- A PR number (e.g. `4821`) → run the **Main Investigation** workflow
3091
- A PR number with `--deep` (e.g. `4821 --deep`) → Main Investigation with rerun detection
3192
- `scan` or `scan <N>` → run the **scan** workflow (N PRs, default 20)
93+
- `scan <N>d` (e.g. `scan 7d`) → scan PRs from the last N days
94+
- `scan --since <date/period> --until <date/period>` → scan a specific time window (e.g. `--since 14d --until 7d` or `--since 2026-04-01 --until 2026-04-15`)
95+
- `scan --deep` → scan with rerun detection and cross-PR code overlap analysis
3296
- `scan --file <filename>` (e.g. `scan --file pipelineCreateRuns.cy.ts`) → run the **file-scoped scan** workflow — fetches CI logs to find every PR where that specific test file failed
97+
- Any `scan` variant may combine modifiers (e.g. `scan 30d --deep --file pipelineCreateRuns.cy.ts`)
3398
- Empty → ask the user for a PR number, then run **Main Investigation**
3499

35100
---
@@ -44,7 +109,7 @@ Apply this conservatively — a real regression can produce the same symptoms as
44109
- If your analysis concludes the failure is likely deterministic (wrong selector, missing element, new test with a bug), classify as Likely real instead and explain why
45110
- Do NOT label it flaky — surface it as a possibility
46111
- Prompt the dev: is this related to their changes? Has this test passed on `main` recently?
47-
- Recommended action: gather multiple signals to confirm — see "If You Confirm It's Flaky" in any investigation report; then log a RHOAIENG Task with label `flaky-test`
112+
- Recommended action: gather multiple signals to confirm — see "If You Confirm It's Flaky" in any investigation report; then log a RHOAIENG Task with label `flaky-test` if not already tracked
48113

49114
**Likely real**
50115
- No symptom pattern match and no cross-PR recurrence signal

0 commit comments

Comments
 (0)