|
1 | 1 | --- |
2 | 2 | name: triage_ci_failure |
3 | | -description: Triage CI failures, flaky tests, and broken builds in the sequencer mono-repo. Auto-invoke when a user mentions a failing CI job, flaky test, red check, or pastes a GitHub Actions URL — context (PR link, CI job link, base branch) must be gathered BEFORE any code investigation begins. |
| 3 | +description: Triage CI failures, flaky tests, and broken builds in the sequencer mono-repo. Use when a user mentions a failing CI job, flaky test, red check, or shares a GitHub Actions / PR URL — the skill pulls failure context directly from the public GitHub REST API so you can usually diagnose and report a verdict without asking the user any follow-up questions. |
4 | 4 | --- |
5 | 5 |
|
6 | 6 | # Triage CI Failure |
7 | 7 |
|
8 | | -When invoked (typically because someone tagged Claude in the mono-repo Slack channel about a CI failure or flaky test), follow this workflow to gather context before investigating. |
| 8 | +Usually invoked when someone tags Claude in the mono-repo Slack channel about a CI failure. The repo (`starkware-libs/sequencer`) is **public**, so most of the GitHub REST API is reachable without auth. Your goal: diagnose and report a verdict **without asking follow-up questions**. Pull what failed, on which step, with which annotation, on which attempts — then report. Only fall back to "please paste the logs" when the public API genuinely can't get you there; never ask to confirm context you can already fetch. |
9 | 9 |
|
10 | | -## Step 1: Gather Required Context |
| 10 | +**Endpoint catalog, pagination, rate limits, and which tools to use in each environment live in `references/github_api.md`.** Read it when you need an exact URL; this file is the decision flow. Substitute `O=starkware-libs`, `R=sequencer` throughout. |
11 | 11 |
|
12 | | -Before starting any investigation, you MUST have the following information. Check if any of these are missing from the message or thread: |
| 12 | +## Step 1: Resolve the input, then check it isn't stale |
13 | 13 |
|
14 | | -### Required Information |
| 14 | +From the message, extract a PR URL, run URL (`/actions/runs/{run_id}`), job URL (`.../job/{job_id}`), check-run id, commit SHA, or branch name. Key fact: **`job.id == check_run.id`** — one numeric id bridges "I have a job link" and "I want its annotations." With only a branch name, list recent failed runs on it before asking the user anything. |
15 | 15 |
|
16 | | -| Item | Why Needed | Example | |
17 | | -|------|------------|---------| |
18 | | -| **PR link** or **branch name** | To understand what code is being tested | `https://github.com/starkware-libs/sequencer/pull/12345` or `feature/my-branch` | |
19 | | -| **Failed CI job link** | To get a `details_url` you can open and ask the user to paste relevant log lines from | `https://github.com/starkware-libs/sequencer/actions/runs/123456/job/789` | |
20 | | -| **Base branch** | The branch this PR targets — check `scripts/parent_branch.txt` for the default, don't assume `main` | `main`, `release/v1.2`, `feature/epic-branch` | |
21 | | -| **Is this a new failure or flaky?** | Determines investigation approach | "Started failing today" vs "Fails ~10% of runs" | |
| 16 | +**Triage the pasted link — it's usually accurate for the failure they want explained.** Diagnose that run/job even if the user re-ran it afterward (a common flow: paste a link, then re-run assuming it's flaky). As a *complementary* check, note whether the run is stale: this repo uses Graphite stacks, so a run's `head_sha` can lag the PR's current `head.sha` (`GET /pulls/{pr}`). If they differ, also report the current head's status — so a "merge-gatekeeper noise, harmless" verdict on an old SHA doesn't mask a genuinely red live head. Add that as context; don't discard the pasted run in favor of the current head. |
22 | 17 |
|
23 | | -### Nice to Have |
| 18 | +## Step 2: Did it already go green on a re-run? |
24 | 19 |
|
25 | | -- Error message snippet (the available GitHub MCP tools only expose check-run metadata, not raw Actions log output, so a pasted snippet often unblocks the fastest investigation) |
26 | | -- Whether this was working before a recent rebase |
27 | | -- Related PRs or recent merges that might have caused regression |
| 20 | +`GET /actions/runs/{run_id}` → check `run_attempt`, `previous_attempt_url`, `conclusion`. If it was re-run (`run_attempt > 1` or `previous_attempt_url` set) **and** the latest `conclusion` is `success`, the workflow already passed on retry — **lead with "the PR isn't blocked."** |
28 | 21 |
|
29 | | ---- |
30 | | - |
31 | | -## Step 2: If Missing Information, Ask First |
| 22 | +But don't wave it away: a fail-then-pass with no code change is a **flaky test**, a real signal worth understanding. So still: |
| 23 | +1. Find the flaky job/step — `GET /actions/runs/{run_id}/jobs?filter=all` (the `filter=all` matters; the default hides the failed earlier attempt). Pull that job's annotations. |
| 24 | +2. Judge whether it's a known flake (see Step 4's flakiness note). |
| 25 | +3. Report a recommendation, e.g. *"Passed on re-run (attempt 2), PR not blocked. Attempt-1 failure was `run-integration-tests` — flaky; worth a tracking issue rather than per-PR re-runs."* |
32 | 26 |
|
33 | | -If ANY required information is missing, reply in the thread (Slack or PR comment, wherever you were invoked) asking for it. Do NOT start investigating with incomplete context. |
| 27 | +Corollary trap: a pasted *job* link can point at a failed earlier attempt while the run is now green. Reconcile the job's `run_attempt` against the run's current one (via `filter=all`) before calling anything broken. |
34 | 28 |
|
35 | | -**Template response:** |
| 29 | +**Diagnose the sporadic failure either way.** A green-on-latest run doesn't end the triage — if a failure happened (even once, even already re-run away), root-cause it via Steps 3–4. Continue below regardless; the only thing the green status changes is the "is the PR blocked?" answer. |
36 | 30 |
|
37 | | -> To investigate this properly, I need a bit more context: |
38 | | -> |
39 | | -> - [ ] **PR/Branch**: Which PR or branch is failing? (link preferred) |
40 | | -> - [ ] **CI Job**: Link to the failed job and, if convenient, paste the relevant error lines |
41 | | -> - [ ] **Base branch**: What branch is this targeting? (don't assume main) |
42 | | -> - [ ] **Failure pattern**: Is this a new failure or has it been flaky? |
43 | | -> |
44 | | -> Once I have these, I'll dig in! |
| 31 | +## Step 3: Fast path — PR to root cause in a few calls |
45 | 32 |
|
46 | | -Adapt this based on what's already provided — only ask for what's missing. |
| 33 | +1. `GET /pulls/{pr}` → `head.sha`, `base.ref` |
| 34 | +2. `GET /commits/{head.sha}/check-runs?filter=all&per_page=100` → every check-run at that SHA |
| 35 | +3. Keep `conclusion in ('failure','timed_out')`; **skip `cancelled`/`skipped`/`neutral`** — a `cancelled` job usually means a sibling failed first, so the cause is elsewhere |
| 36 | +4. For each failing check, `GET /check-runs/{check_id}/annotations` → the inline error (file + line + text) is usually all you need |
47 | 37 |
|
48 | | ---- |
| 38 | +**merge-gatekeeper / merge-gatekeeper-new** failing alone is a downstream alarm — something else failed first. Look at sibling check-runs at the same SHA or the previous attempt. Second mode: gatekeeper also fails by **timing out** waiting on a required check that never reached `success` (e.g. a `cancelled` sibling) — then there's *no* failed sibling at this SHA; the real red is usually on a newer SHA, i.e. the pasted run is stale (Step 1). |
49 | 39 |
|
50 | | -## Step 3: Verify the Context |
| 40 | +## Step 4: When annotations aren't enough |
51 | 41 |
|
52 | | -Once you have the required information: |
| 42 | +Annotations are the primary signal, but for test/build jobs they're often just `"Process completed with exit code 1"`. **Treat a generic exit-code annotation (or an empty one, or null `output.*`) as no signal** — the real assertion/panic is only in the raw step log, which needs auth (`/logs` is 403 unauthenticated; reach it via `gh run view --job {id} --log-failed` when authed — see `references/github_api.md`). |
53 | 43 |
|
54 | | -1. **Open the PR** — use `mcp__github__pull_request_read` with `method=get` to confirm the base branch, changed files, and any existing review comments |
55 | | -2. **Inspect the failed check** — use `method=get_check_runs` for status/conclusion and the `details_url`; for raw Actions logs you'll need the user to paste them (no MCP tool returns them directly) |
56 | | -3. **Check if known flaky** — search CLAUDE.md "Common Gotchas" and recent Slack history for known flaky tests |
57 | | -4. **Determine scope** — is this related to the PR's changes, or a pre-existing/infrastructure issue? |
| 44 | +When logs are unreachable, **don't go silent — narrow it from the diff.** Pull `pulls/{pr}/files`; if the failing job is `run-tests` and the PR edits `crates/foo/.../bar_test.rs` or a fixture, report a *scoped hypothesis* ("likely a `foo` test or stale fixture from this rename") plus the one confirming command. That beats both a bare "can't see logs" and a fabricated test name. |
58 | 45 |
|
59 | | ---- |
| 46 | +**Flakiness check:** to tell flaky from newly-broken, see whether the same job fails in unrelated runs. Note many jobs here (`run-integration-tests`, `run-tests`) run only on `pull_request`, never `push` to `main` — so `branch=main&status=failure` won't show them and you'd wrongly conclude "not a known flake." For those, judge by (a) whether this run went green on re-run (Step 2, strongest signal) and (b) scanning recent failed runs of the same workflow across other PRs. Say which signal you used. |
60 | 47 |
|
61 | | -## Step 4: Investigate and Report |
| 48 | +## Step 5: Report — ask only when genuinely blocked |
62 | 49 |
|
63 | | -Only after completing steps 1-3, begin your investigation: |
| 50 | +You usually have enough to classify the failure yourself. Report directly; don't tack on reflexive questions — every needless "is this flaky for you?" trains the user to expect noise. Answer these yourself rather than asking: |
| 51 | +- **New or flaky?** → flakiness check above. |
| 52 | +- **Caused by this PR?** → diff `pulls/{pr}/files` against the failing crate/test path. |
| 53 | +- **Known pattern?** → see Common patterns below. |
64 | 54 |
|
65 | | -1. **If it's a code issue in the PR**: identify the root cause, propose a fix |
66 | | -2. **If it's a known flaky test**: link to prior discussions, explain the flakiness pattern |
67 | | -3. **If it's infrastructure/transient**: suggest a re-run and explain why |
68 | | -4. **If unclear**: share what you found and what you'd need to dig deeper |
| 55 | +Ask the user *only* when a tool genuinely can't close the gap: |
| 56 | +- annotation empty/generic AND `output.text` empty AND no `gh`/MCP raw-log access → ask for a paste; |
| 57 | +- the cause hinges on something only they know (e.g. "did your last rebase pick up commit X?") → ask that. |
69 | 58 |
|
70 | | -Always report back in the thread with: |
71 | | -- What you found |
72 | | -- Whether action is needed |
73 | | -- Proposed next steps (if any) |
| 59 | +Otherwise don't ask — report and move on. |
74 | 60 |
|
75 | | ---- |
| 61 | +## Step 6: Classify and report |
76 | 62 |
|
77 | | -## Step 5: Commit and Push |
| 63 | +1. **Code issue in the PR** — name the file/line, propose a fix |
| 64 | +2. **Known flaky test** — link prior discussion, suggest re-run |
| 65 | +3. **Infrastructure / transient** (network, action-download, GCloud) — suggest re-run, explain why |
| 66 | +4. **Pre-existing on the base branch** — call it out; the PR didn't cause it |
78 | 67 |
|
79 | | -When fixing the issue, create one commit per PR. |
| 68 | +State what you found, whether action is needed, and the next step. |
80 | 69 |
|
81 | | ---- |
| 70 | +## Step 7: Fix only if asked |
82 | 71 |
|
83 | | -## Common Patterns in This Repo |
| 72 | +Apply a fix and commit **only** if the user explicitly asks. A triage request isn't an implicit "go patch it." Commit convention: `scope: subject` (no `feat:`/`fix:` prefix), one commit per PR. |
84 | 73 |
|
85 | | -From CLAUDE.md — these failures are often NOT code bugs: |
| 74 | +## Common patterns in this repo |
86 | 75 |
|
87 | 76 | - `blockifier_reexecution` — transient GCloud network issues; suggest re-run |
88 | | -- `merge-gatekeeper` / `merge-gatekeeper-new` — downstream failures (other checks failed first) |
89 | | -- Formatting failures — run `scripts/rust_fmt.sh` (uses pinned nightly toolchain), NOT `cargo fmt` directly |
| 77 | +- `merge-gatekeeper` / `merge-gatekeeper-new` — downstream/timeout failure; find the upstream cause (Step 3) |
| 78 | +- Formatting failures — run `scripts/rust_fmt.sh` (pinned nightly), NOT `cargo fmt` directly |
| 79 | +- Action-download failures from `codeload.github.com` (404/503) — GitHub-side flake; re-run |
0 commit comments