Skip to content

Commit 04a1153

Browse files
yoblinclaude
andcommitted
Add Claude triage step to canary ferry workflows
On scheduled canary failures (TPU and GPU), Claude now runs before cluster teardown to diagnose the failure, file a GitHub issue, and produce a Slack summary that replaces the static one-liner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 180cb79 commit 04a1153

3 files changed

Lines changed: 149 additions & 2 deletions

File tree

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
---
2+
name: canary-triage
3+
description: Triage a failed canary ferry run. Gather diagnostics, identify root cause, file a GitHub issue, and write a Slack summary. Used by CI on scheduled canary failures.
4+
---
5+
6+
# Skill: Canary Triage
7+
8+
Triage a failed canary ferry run. Diagnose root cause, file a GitHub issue,
9+
write a Slack summary. Diagnosis and reporting only — no code changes, no PRs.
10+
11+
## Inputs (environment variables)
12+
13+
| Variable | Description |
14+
|---|---|
15+
| `CANARY_LANE` | `gpu` (CoreWeave) or `tpu` (GCP) |
16+
| `CANARY_JOB_ID` | Iris job ID |
17+
| `CANARY_RUN_ID` | W&B run ID |
18+
| `IRIS_CONFIG` | Path to Iris cluster config |
19+
| `IRIS_NAMESPACE` | Kubernetes namespace (CW only) |
20+
| `WANDB_ENTITY` | W&B entity |
21+
| `WANDB_PROJECT` | W&B project |
22+
| `GHA_RUN_URL` | Full URL to the GitHub Actions run |
23+
24+
## Steps
25+
26+
### 1. Gather diagnostics
27+
28+
The cluster is still live. Collect signal now — it will be torn down after you.
29+
30+
- Iris job state via `.venv/bin/iris --config=$IRIS_CONFIG job list --json`
31+
- **GPU lane:** you have kubectl at `~/.kube/coreweave-iris`, namespace `$IRIS_NAMESPACE`.
32+
Get pod status, controller logs, task pod logs, warning events, pod describe.
33+
- **TPU lane:** use `iris process logs` and `iris job list`.
34+
- Re-run `scripts/canary/validate_canary_metrics.py` if you need the validation output.
35+
36+
### 2. Identify root cause
37+
38+
Classify into one of: **infra/scheduling**, **training crash**, **metric regression**,
39+
**controller bug**, **data/storage**.
40+
41+
Use hypothesis-driven diagnosis: state hypothesis, gather evidence, narrow.
42+
Attempt to reproduce the issue locally and minimally.
43+
Triple check that you're narrowing down on the same issue as the one that actually broke the canary.
44+
45+
### 3. File a GitHub issue
46+
47+
Follow the `file-issue` skill. Use the bug-report template.
48+
49+
- **Title:** `[canary-{lane}] {short failure description}`
50+
- **Labels:** `bug`, `agent-generated`, `canary`
51+
- **Body must include** a "Canary run context" section with: lane, job ID,
52+
GHA run URL, W&B run URL, date.
53+
- Support your claims using supporting data (e.g. runtime logs)
54+
- Keep the issue concise and maximally readable for humans.
55+
- Use GFM to make the details (e.g. log traces, code to reproduce issue) optional and declutter the issue.
56+
- Use `--body-file` with a temp file (see `file-issue` skill for the pattern).
57+
58+
### 4. Write `slack_message.md`
59+
60+
Write to the repo root. The workflow reads this file and sends it to Slack.
61+
Always write this file, even if issue creation failed.
62+
63+
Format — keep to 4 lines max:
64+
65+
```
66+
:red_circle: *{GPU|TPU} Canary failed* — {one-line summary}
67+
*Root cause:* {category} — {1 sentence}
68+
*Issue:* {github issue URL}
69+
*GHA run:* {GHA_RUN_URL}
70+
```
71+
72+
If root cause is unclear, say so: `root cause unclear` with your best-guess signals.
73+

.github/workflows/marin-canary-ferry-cw.yaml

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ on:
1717
permissions:
1818
contents: read # actions/checkout
1919
packages: write # docker login ghcr.io for iris cluster start
20+
issues: write # claude triage files issues
21+
id-token: write # claude-code-action OIDC
2022

2123
jobs:
2224
canary-ferry-cw:
@@ -182,6 +184,30 @@ jobs:
182184
kubectl --kubeconfig ~/.kube/coreweave-iris -n ${{ env.IRIS_NAMESPACE }} \
183185
get events --sort-by='.lastTimestamp' --field-selector type!=Normal || true
184186
187+
- name: Claude triage
188+
id: claude_triage
189+
if: failure() && github.event_name == 'schedule'
190+
uses: anthropics/claude-code-action@v1
191+
timeout-minutes: 30
192+
with:
193+
claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN || secrets.CLAUDE_MAX_OAUTH_TOKEN }}
194+
prompt: |
195+
Read .agents/skills/canary-triage/SKILL.md and follow it.
196+
claude_args: |
197+
--model opus
198+
--max-turns 50
199+
--allowedTools "Bash(kubectl:*),Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
200+
env:
201+
CANARY_LANE: gpu
202+
CANARY_JOB_ID: ${{ steps.submit.outputs.job_id }}
203+
CANARY_RUN_ID: ${{ env.RUN_ID }}
204+
IRIS_CONFIG: ${{ env.IRIS_CONFIG }}
205+
IRIS_NAMESPACE: ${{ env.IRIS_NAMESPACE }}
206+
WANDB_ENTITY: ${{ env.WANDB_ENTITY }}
207+
WANDB_PROJECT: ${{ env.WANDB_PROJECT }}
208+
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
209+
GHA_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
210+
185211
# `cluster stop` only deletes Pods; NodePools survive and rely on the
186212
# CW autoscaler to scale down. Delete them explicitly to avoid lingering
187213
# H100 costs.
@@ -196,11 +222,23 @@ jobs:
196222
echo "Keeping node pool alive (keep_nodepool=true)"
197223
fi
198224
225+
- name: Read triage summary
226+
id: slack_message
227+
if: failure() && github.event_name == 'schedule'
228+
run: |
229+
if [ -f slack_message.md ]; then
230+
# Escape for JSON embedding: newlines to \n, quotes escaped
231+
MSG=$(cat slack_message.md | python3 -c 'import sys,json; print(json.dumps(sys.stdin.read()))')
232+
else
233+
MSG='":red_circle: *GPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"'
234+
fi
235+
echo "text=$MSG" >> "$GITHUB_OUTPUT"
236+
199237
- name: Notify Slack on failure
200238
if: failure() && github.event_name == 'schedule'
201239
uses: slackapi/slack-github-action@v2
202240
with:
203241
webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
204242
webhook-type: incoming-webhook
205243
payload: |
206-
text: ":red_circle: *GPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
244+
text: ${{ steps.slack_message.outputs.text }}

.github/workflows/marin-canary-ferry.yaml

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ on:
1212

1313
permissions:
1414
contents: read
15+
issues: write # claude triage files issues
16+
id-token: write # claude-code-action OIDC
1517

1618
jobs:
1719
canary-ferry:
@@ -149,11 +151,45 @@ jobs:
149151
.venv/bin/iris --config=${{ env.IRIS_CONFIG }} \
150152
job list --json 2>/dev/null | jq '.[0:5]' || true
151153
154+
- name: Claude triage
155+
id: claude_triage
156+
if: failure() && github.event_name == 'schedule'
157+
uses: anthropics/claude-code-action@v1
158+
timeout-minutes: 30
159+
with:
160+
claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN || secrets.CLAUDE_MAX_OAUTH_TOKEN }}
161+
prompt: |
162+
Read .agents/skills/canary-triage/SKILL.md and follow it.
163+
claude_args: |
164+
--model opus
165+
--max-turns 50
166+
--allowedTools "Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
167+
env:
168+
CANARY_LANE: tpu
169+
CANARY_JOB_ID: ${{ steps.submit.outputs.job_id }}
170+
CANARY_RUN_ID: ${{ env.RUN_ID }}
171+
IRIS_CONFIG: ${{ env.IRIS_CONFIG }}
172+
WANDB_ENTITY: ${{ env.WANDB_ENTITY }}
173+
WANDB_PROJECT: ${{ env.WANDB_PROJECT }}
174+
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
175+
GHA_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
176+
177+
- name: Read triage summary
178+
id: slack_message
179+
if: failure() && github.event_name == 'schedule'
180+
run: |
181+
if [ -f slack_message.md ]; then
182+
MSG=$(cat slack_message.md | python3 -c 'import sys,json; print(json.dumps(sys.stdin.read()))')
183+
else
184+
MSG='":red_circle: *TPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"'
185+
fi
186+
echo "text=$MSG" >> "$GITHUB_OUTPUT"
187+
152188
- name: Notify Slack on failure
153189
if: failure() && github.event_name == 'schedule'
154190
uses: slackapi/slack-github-action@v2
155191
with:
156192
webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
157193
webhook-type: incoming-webhook
158194
payload: |
159-
text: ":red_circle: *TPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
195+
text: ${{ steps.slack_message.outputs.text }}

0 commit comments

Comments
 (0)