remove load_workflows.py in favor of inline whitelist in SKILL.md

jharan1 · jharan1 · commit f95c89020f8f · 2026-04-20T19:08:17.000+01:00
diff --git a/packages/gen-ai/.claude/skills/flake-check/README.md b/packages/gen-ai/.claude/skills/flake-check/README.md
@@ -98,6 +98,20 @@ For suspected flaky, the skill further annotates based on code overlap:
 
 ---
 
+## Supporting Scripts
+
+The skill uses four Python scripts in `scripts/` for deterministic data collection. All are zero-dependency (Python stdlib + `gh` CLI).
+
+| Script | Purpose |
+|---|---|
+| `fetch_pr_failures.py` | Fetches PR metadata and the list of failing/pending checks from GitHub |
+| `fetch_test_failures.py` | Downloads CI job logs and extracts individual failing test names and errors |
+| `scan_prs.py` | Surveys recent PRs for recurring failure patterns without fetching logs |
+
+Check name classification (deterministic vs. non-deterministic, flaky risk level) is handled via an inline whitelist table in SKILL.md. The whitelist covers the set of check names that appear in practice — including non-obvious ones like `test-and-build` (a high-flaky-risk Cypress suite) and `test` (a Go BFF test). If a new check type is added to CI, update the table in SKILL.md.
+
+---
+
 ## Tracking Flaky Tests
 
 Suspected/Confirmed flaky tests can be tracked as **Jira Tasks** with the `flaky-test` label in RHOAIENG. When an investigation identifies a suspected flaky test with strong evidence, the report includes pre-filled Jira fields — say "create the Jira" to file it.
diff --git a/packages/gen-ai/.claude/skills/flake-check/SKILL.md b/packages/gen-ai/.claude/skills/flake-check/SKILL.md
@@ -90,16 +90,20 @@ This returns a JSON object with:
 
 If there are no `failing_checks` and no `pending_checks`, report that all checks are passing and note any remaining review blockers from `review_decision`.
 
-For each `failing_check`, use judgment about whether to fetch logs:
-- **Self-explanatory deterministic checks** (`Lint`, `Type-Check`, `Build *`, `Application Quality Gate`) — classify immediately as a real failure without fetching logs; these jobs don't flake
-- **Ambiguous or unrecognised check names** — if the check name doesn't clearly indicate its job type, run the workflow catalog tool (see below) before deciding whether to fetch logs
-- **Non-deterministic checks** (`Cypress-Mock-Tests`, `Unit-Tests`, `Contract-Tests`, and any check with a matrix suffix in parentheses) — proceed to Step 2
-
-**Workflow catalog tool** — use when a check name is ambiguous or unrecognised:
-```bash
-python3 <base_path>/scripts/load_workflows.py --pr-only
-```
-The `check_prefix_index` maps check name prefixes to `{category, deterministic, flaky_risk, description}`. Match by finding the longest prefix that is a prefix of the failing check name (matrix jobs append values in parentheses, e.g. `Cypress-Mock-Tests (mcpCatalog, ...)` matches prefix `Cypress-Mock-Tests`).
+For each `failing_check`, use this table to decide whether to fetch logs. Matrix jobs append values in parentheses — match by prefix (e.g. `Cypress-Mock-Tests (mcpCatalog, ...)` matches `Cypress-Mock-Tests`):
+
+| Check name prefix | Type | Flaky risk | Action |
+|---|---|---|---|
+| `Cypress-Mock-Tests` | cypress-mock | high | Fetch logs → Step 2 |
+| `test-and-build` | cypress-mock | high | Fetch logs → Step 2 |
+| `Unit-Tests` | jest | low | Fetch logs → Step 2 |
+| `Contract-Tests` | contract | medium | Fetch logs → Step 2 |
+| `test` | go-test | low | Fetch logs → Step 2 |
+| `Lint` | lint | none | Skip — deterministic, classify as ❌ Likely Real |
+| `Type-Check` | type-check | none | Skip — deterministic, classify as ❌ Likely Real |
+| `Build *` | build | none | Skip — deterministic, classify as ❌ Likely Real |
+| `Application Quality Gate` | quality-gate | none | Skip — deterministic, classify as ❌ Likely Real |
+| anything else | unknown | unknown | Use judgment — if name suggests infra/setup, skip; otherwise fetch logs |
 
 ### Step 2 — Fetch test failure details for each non-deterministic failing check
 
@@ -312,7 +316,7 @@ Run this once per PR number across all patterns (deduplicate PR numbers to avoid
 **Signal assignment logic (in priority order):**
 1. `⚠️ Suspected Flaky` — check name appears in `patterns` (cross-PR recurrence); annotate with overlap verdict when `--deep` was used
 2. `❌ Likely Real` — check name is clearly deterministic (Lint, Type-Check, Build, kustomize) — these don't flake
-3. `❓ Unknown` — check name is ambiguous or is a test runner check with no cross-PR pattern yet; run `load_workflows.py --pr-only` and look up the name in `check_prefix_index` to resolve it; note that no pattern at scan level doesn't mean the failure isn't flaky — it means there isn't enough signal yet without fetching logs
+3. `❓ Unknown` — check name is ambiguous or is a test runner check with no cross-PR pattern yet; look it up in the check name table in Step 1 of Main Investigation to classify it; note that no pattern at scan level doesn't mean the failure isn't flaky — it means there isn't enough signal yet without fetching logs
 
 ```
 ## Recent PR Scan — <since>–<until> | <N> PRs scanned (<bots_excluded> bots excluded)
diff --git a/packages/gen-ai/.claude/skills/flake-check/scripts/load_workflows.py b/packages/gen-ai/.claude/skills/flake-check/scripts/load_workflows.py