Skip to content

Commit ca1636c

Browse files
yonromaiyoblinclaude
authored
Add Claude triage step to canary ferry workflows (#4177)
## Summary When a scheduled canary ferry fails on main, Claude now triages the failure before the cluster is torn down. It gathers diagnostics (kubectl/iris logs, pod state, events), identifies the root cause, files a GitHub issue with structured context, and writes a `slack_message.md` that the Slack step picks up instead of the old static one-liner. - New skill: `.agents/skills/canary-triage/SKILL.md` — self-contained triage prompt (diagnosis only, no code changes) - Both TPU and GPU canary workflows updated with the same pattern - Claude step runs between "Capture failure diagnostics" and cluster teardown (GPU) / Slack (TPU), so it has live cluster access - Slack step falls back to the original message if Claude didn't produce one - 30-minute timeout, scheduled failures only ## Test plan - [ ] Trigger GPU canary via `workflow_dispatch` with a low `target_tokens` to force a metric validation failure; verify Claude files an issue and the Slack message includes root cause - [ ] Same for TPU canary - [ ] Verify a successful canary run is unaffected (Claude step is skipped) - [ ] Verify manual `workflow_dispatch` runs skip the Claude step (`github.event_name != 'schedule'`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 180cb79 commit ca1636c

3 files changed

Lines changed: 146 additions & 12 deletions

File tree

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
---
2+
name: canary-triage
3+
description: Triage a failed canary ferry run. Gather diagnostics, identify root cause, file a GitHub issue, and write a Slack summary. Used by CI on scheduled canary failures.
4+
---
5+
6+
# Skill: Canary Triage
7+
8+
Triage a failed canary ferry run. Diagnose root cause, file a GitHub issue,
9+
write a Slack summary. Diagnosis and reporting only — no code changes, no PRs.
10+
11+
## Inputs (environment variables)
12+
13+
| Variable | Description |
14+
|---|---|
15+
| `CANARY_LANE` | `gpu` (CoreWeave) or `tpu` (GCP) |
16+
| `CANARY_JOB_ID` | Iris job ID |
17+
| `CANARY_RUN_ID` | W&B run ID |
18+
| `IRIS_CONFIG` | Path to Iris cluster config |
19+
| `IRIS_NAMESPACE` | Kubernetes namespace (CW only) |
20+
| `WANDB_ENTITY` | W&B entity |
21+
| `WANDB_PROJECT` | W&B project |
22+
| `GHA_RUN_URL` | Full URL to the GitHub Actions run |
23+
24+
## Steps
25+
26+
### 1. Gather diagnostics
27+
28+
The cluster is still live. Collect signal now — it will be torn down after you.
29+
30+
- Iris job state via `.venv/bin/iris --config=$IRIS_CONFIG job list --json`
31+
- **GPU lane:** you have kubectl at `~/.kube/coreweave-iris`, namespace `$IRIS_NAMESPACE`.
32+
Get pod status, controller logs, task pod logs, warning events, pod describe.
33+
- **TPU lane:** use `iris process logs` and `iris job list`.
34+
- Re-run `scripts/canary/validate_canary_metrics.py` if you need the validation output.
35+
36+
### 2. Identify root cause
37+
38+
Classify into one of: **infra/scheduling**, **training crash**, **metric regression**,
39+
**controller bug**, **data/storage**.
40+
41+
Use hypothesis-driven diagnosis: state hypothesis, gather evidence, narrow.
42+
Attempt to reproduce the issue locally and minimally.
43+
Triple check that you're narrowing down on the same issue as the one that actually broke the canary.
44+
45+
### 3. File a GitHub issue
46+
47+
Follow the `file-issue` skill. Use the bug-report template.
48+
49+
- **Title:** `[canary-{lane}] {short failure description}`
50+
- **Labels:** `bug`, `agent-generated`, `canary`
51+
- **Body must include** a "Canary run context" section with: lane, job ID,
52+
GHA run URL, W&B run URL, date.
53+
- Support your claims using supporting data (e.g. runtime logs)
54+
- Keep the issue concise and maximally readable for humans.
55+
- Use GFM to make the details (e.g. log traces, code to reproduce issue) optional and declutter the issue.
56+
- Use `--body-file` with a temp file (see `file-issue` skill for the pattern).
57+
58+
### 4. Write `slack_message.md`
59+
60+
Write to the repo root. The workflow reads this file and sends it to Slack.
61+
Always write this file, even if issue creation failed.
62+
63+
Format — keep to 4 lines max:
64+
65+
```
66+
:red_circle: *{GPU|TPU} Canary failed* — {one-line summary}
67+
*Root cause:* {category} — {1 sentence}
68+
*Issue:* {github issue URL}
69+
*GHA run:* {GHA_RUN_URL}
70+
```
71+
72+
If root cause is unclear, say so: `root cause unclear` with your best-guess signals.
73+

.github/workflows/marin-canary-ferry-cw.yaml

Lines changed: 37 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ on:
1717
permissions:
1818
contents: read # actions/checkout
1919
packages: write # docker login ghcr.io for iris cluster start
20+
issues: write # claude triage files issues
21+
id-token: write # claude-code-action OIDC
2022

2123
jobs:
2224
canary-ferry-cw:
@@ -182,6 +184,30 @@ jobs:
182184
kubectl --kubeconfig ~/.kube/coreweave-iris -n ${{ env.IRIS_NAMESPACE }} \
183185
get events --sort-by='.lastTimestamp' --field-selector type!=Normal || true
184186
187+
- name: Claude triage
188+
id: claude_triage
189+
if: failure() && github.event_name == 'schedule'
190+
uses: anthropics/claude-code-action@v1
191+
timeout-minutes: 30
192+
with:
193+
claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN || secrets.CLAUDE_MAX_OAUTH_TOKEN }}
194+
prompt: |
195+
Read .agents/skills/canary-triage/SKILL.md and follow it.
196+
claude_args: |
197+
--model opus
198+
--max-turns 50
199+
--allowedTools "Bash(kubectl:*),Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
200+
env:
201+
CANARY_LANE: gpu
202+
CANARY_JOB_ID: ${{ steps.submit.outputs.job_id }}
203+
CANARY_RUN_ID: ${{ env.RUN_ID }}
204+
IRIS_CONFIG: ${{ env.IRIS_CONFIG }}
205+
IRIS_NAMESPACE: ${{ env.IRIS_NAMESPACE }}
206+
WANDB_ENTITY: ${{ env.WANDB_ENTITY }}
207+
WANDB_PROJECT: ${{ env.WANDB_PROJECT }}
208+
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
209+
GHA_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
210+
185211
# `cluster stop` only deletes Pods; NodePools survive and rely on the
186212
# CW autoscaler to scale down. Delete them explicitly to avoid lingering
187213
# H100 costs.
@@ -198,9 +224,14 @@ jobs:
198224
199225
- name: Notify Slack on failure
200226
if: failure() && github.event_name == 'schedule'
201-
uses: slackapi/slack-github-action@v2
202-
with:
203-
webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
204-
webhook-type: incoming-webhook
205-
payload: |
206-
text: ":red_circle: *GPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
227+
env:
228+
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
229+
FALLBACK_TEXT: ":red_circle: *GPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
230+
run: |
231+
if [ -f slack_message.md ]; then
232+
TEXT=$(cat slack_message.md)
233+
else
234+
TEXT="$FALLBACK_TEXT"
235+
fi
236+
PAYLOAD=$(python3 -c "import sys,json; print(json.dumps({'text': sys.stdin.read()}))" <<< "$TEXT")
237+
curl -sf -X POST -H 'Content-Type: application/json' -d "$PAYLOAD" "$SLACK_WEBHOOK_URL"

.github/workflows/marin-canary-ferry.yaml

Lines changed: 36 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ on:
1212

1313
permissions:
1414
contents: read
15+
issues: write # claude triage files issues
16+
id-token: write # claude-code-action OIDC
1517

1618
jobs:
1719
canary-ferry:
@@ -149,11 +151,39 @@ jobs:
149151
.venv/bin/iris --config=${{ env.IRIS_CONFIG }} \
150152
job list --json 2>/dev/null | jq '.[0:5]' || true
151153
152-
- name: Notify Slack on failure
154+
- name: Claude triage
155+
id: claude_triage
153156
if: failure() && github.event_name == 'schedule'
154-
uses: slackapi/slack-github-action@v2
157+
uses: anthropics/claude-code-action@v1
158+
timeout-minutes: 30
155159
with:
156-
webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
157-
webhook-type: incoming-webhook
158-
payload: |
159-
text: ":red_circle: *TPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
160+
claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN || secrets.CLAUDE_MAX_OAUTH_TOKEN }}
161+
prompt: |
162+
Read .agents/skills/canary-triage/SKILL.md and follow it.
163+
claude_args: |
164+
--model opus
165+
--max-turns 50
166+
--allowedTools "Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
167+
env:
168+
CANARY_LANE: tpu
169+
CANARY_JOB_ID: ${{ steps.submit.outputs.job_id }}
170+
CANARY_RUN_ID: ${{ env.RUN_ID }}
171+
IRIS_CONFIG: ${{ env.IRIS_CONFIG }}
172+
WANDB_ENTITY: ${{ env.WANDB_ENTITY }}
173+
WANDB_PROJECT: ${{ env.WANDB_PROJECT }}
174+
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
175+
GHA_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
176+
177+
- name: Notify Slack on failure
178+
if: failure() && github.event_name == 'schedule'
179+
env:
180+
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
181+
FALLBACK_TEXT: ":red_circle: *TPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
182+
run: |
183+
if [ -f slack_message.md ]; then
184+
TEXT=$(cat slack_message.md)
185+
else
186+
TEXT="$FALLBACK_TEXT"
187+
fi
188+
PAYLOAD=$(python3 -c "import sys,json; print(json.dumps({'text': sys.stdin.read()}))" <<< "$TEXT")
189+
curl -sf -X POST -H 'Content-Type: application/json' -d "$PAYLOAD" "$SLACK_WEBHOOK_URL"

0 commit comments

Comments
 (0)