Add Claude triage step to canary ferry workflows (#4177)

yonromai · yoblin · claude · web-flow · commit ca1636cc25e9 · 2026-03-26T10:29:22.000-07:00
## Summary When a scheduled canary ferry fails on main, Claude now triages the failure before the cluster is torn down. It gathers diagnostics (kubectl/iris logs, pod state, events), identifies the root cause, files a GitHub issue with structured context, and writes a `slack_message.md` that the Slack step picks up instead of the old static one-liner. - New skill: `.agents/skills/canary-triage/SKILL.md` — self-contained triage prompt (diagnosis only, no code changes) - Both TPU and GPU canary workflows updated with the same pattern - Claude step runs between "Capture failure diagnostics" and cluster teardown (GPU) / Slack (TPU), so it has live cluster access - Slack step falls back to the original message if Claude didn't produce one - 30-minute timeout, scheduled failures only ## Test plan - [ ] Trigger GPU canary via `workflow_dispatch` with a low `target_tokens` to force a metric validation failure; verify Claude files an issue and the Slack message includes root cause - [ ] Same for TPU canary - [ ] Verify a successful canary run is unaffected (Claude step is skipped) - [ ] Verify manual `workflow_dispatch` runs skip the Claude step (`github.event_name != 'schedule'`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/.agents/skills/canary-triage/SKILL.md b/.agents/skills/canary-triage/SKILL.md
@@ -0,0 +1,73 @@
+---
+name: canary-triage
+description: Triage a failed canary ferry run. Gather diagnostics, identify root cause, file a GitHub issue, and write a Slack summary. Used by CI on scheduled canary failures.
+---
+
+# Skill: Canary Triage
+
+Triage a failed canary ferry run. Diagnose root cause, file a GitHub issue,
+write a Slack summary. Diagnosis and reporting only — no code changes, no PRs.
+
+## Inputs (environment variables)
+
+| Variable | Description |
+|---|---|
+| `CANARY_LANE` | `gpu` (CoreWeave) or `tpu` (GCP) |
+| `CANARY_JOB_ID` | Iris job ID |
+| `CANARY_RUN_ID` | W&B run ID |
+| `IRIS_CONFIG` | Path to Iris cluster config |
+| `IRIS_NAMESPACE` | Kubernetes namespace (CW only) |
+| `WANDB_ENTITY` | W&B entity |
+| `WANDB_PROJECT` | W&B project |
+| `GHA_RUN_URL` | Full URL to the GitHub Actions run |
+
+## Steps
+
+### 1. Gather diagnostics
+
+The cluster is still live. Collect signal now — it will be torn down after you.
+
+- Iris job state via `.venv/bin/iris --config=$IRIS_CONFIG job list --json`
+- **GPU lane:** you have kubectl at `~/.kube/coreweave-iris`, namespace `$IRIS_NAMESPACE`.
+  Get pod status, controller logs, task pod logs, warning events, pod describe.
+- **TPU lane:** use `iris process logs` and `iris job list`.
+- Re-run `scripts/canary/validate_canary_metrics.py` if you need the validation output.
+
+### 2. Identify root cause
+
+Classify into one of: **infra/scheduling**, **training crash**, **metric regression**,
+**controller bug**, **data/storage**.
+
+Use hypothesis-driven diagnosis: state hypothesis, gather evidence, narrow.
+Attempt to reproduce the issue locally and minimally.
+Triple check that you're narrowing down on the same issue as the one that actually broke the canary.
+
+### 3. File a GitHub issue
+
+Follow the `file-issue` skill. Use the bug-report template.
+
+- **Title:** `[canary-{lane}] {short failure description}`
+- **Labels:** `bug`, `agent-generated`, `canary`
+- **Body must include** a "Canary run context" section with: lane, job ID,
+  GHA run URL, W&B run URL, date.
+- Support your claims using supporting data (e.g. runtime logs)
+- Keep the issue concise and maximally readable for humans.
+- Use GFM to make the details (e.g. log traces, code to reproduce issue) optional and declutter the issue.
+- Use `--body-file` with a temp file (see `file-issue` skill for the pattern).
+
+### 4. Write `slack_message.md`
+
+Write to the repo root. The workflow reads this file and sends it to Slack.
+Always write this file, even if issue creation failed.
+
+Format — keep to 4 lines max:
+
+```
+:red_circle: *{GPU|TPU} Canary failed* — {one-line summary}
+*Root cause:* {category} — {1 sentence}
+*Issue:* {github issue URL}
+*GHA run:* {GHA_RUN_URL}
+```
+
+If root cause is unclear, say so: `root cause unclear` with your best-guess signals.
+
diff --git a/.github/workflows/marin-canary-ferry-cw.yaml b/.github/workflows/marin-canary-ferry-cw.yaml
@@ -17,6 +17,8 @@ on:
 permissions:
   contents: read   # actions/checkout
   packages: write  # docker login ghcr.io for iris cluster start
+  issues: write    # claude triage files issues
+  id-token: write  # claude-code-action OIDC
 
 jobs:
   canary-ferry-cw:
@@ -182,6 +184,30 @@ jobs:
           kubectl --kubeconfig ~/.kube/coreweave-iris -n ${{ env.IRIS_NAMESPACE }} \
             get events --sort-by='.lastTimestamp' --field-selector type!=Normal || true
 
+      - name: Claude triage
+        id: claude_triage
+        if: failure() && github.event_name == 'schedule'
+        uses: anthropics/claude-code-action@v1
+        timeout-minutes: 30
+        with:
+          claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN || secrets.CLAUDE_MAX_OAUTH_TOKEN }}
+          prompt: |
+            Read .agents/skills/canary-triage/SKILL.md and follow it.
+          claude_args: |
+            --model opus
+            --max-turns 50
+            --allowedTools "Bash(kubectl:*),Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
+        env:
+          CANARY_LANE: gpu
+          CANARY_JOB_ID: ${{ steps.submit.outputs.job_id }}
+          CANARY_RUN_ID: ${{ env.RUN_ID }}
+          IRIS_CONFIG: ${{ env.IRIS_CONFIG }}
+          IRIS_NAMESPACE: ${{ env.IRIS_NAMESPACE }}
+          WANDB_ENTITY: ${{ env.WANDB_ENTITY }}
+          WANDB_PROJECT: ${{ env.WANDB_PROJECT }}
+          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
+          GHA_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
+
       # `cluster stop` only deletes Pods; NodePools survive and rely on the
       # CW autoscaler to scale down. Delete them explicitly to avoid lingering
       # H100 costs.
@@ -198,9 +224,14 @@ jobs:
 
       - name: Notify Slack on failure
         if: failure() && github.event_name == 'schedule'
-        uses: slackapi/slack-github-action@v2
-        with:
-          webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
-          webhook-type: incoming-webhook
-          payload: |
-            text: ":red_circle: *GPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+        env:
+          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
+          FALLBACK_TEXT: ":red_circle: *GPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+        run: |
+          if [ -f slack_message.md ]; then
+            TEXT=$(cat slack_message.md)
+          else
+            TEXT="$FALLBACK_TEXT"
+          fi
+          PAYLOAD=$(python3 -c "import sys,json; print(json.dumps({'text': sys.stdin.read()}))" <<< "$TEXT")
+          curl -sf -X POST -H 'Content-Type: application/json' -d "$PAYLOAD" "$SLACK_WEBHOOK_URL"
diff --git a/.github/workflows/marin-canary-ferry.yaml b/.github/workflows/marin-canary-ferry.yaml
@@ -12,6 +12,8 @@ on:
 
 permissions:
   contents: read
+  issues: write    # claude triage files issues
+  id-token: write  # claude-code-action OIDC
 
 jobs:
   canary-ferry:
@@ -149,11 +151,39 @@ jobs:
           .venv/bin/iris --config=${{ env.IRIS_CONFIG }} \
             job list --json 2>/dev/null | jq '.[0:5]' || true
 
-      - name: Notify Slack on failure
+      - name: Claude triage
+        id: claude_triage
         if: failure() && github.event_name == 'schedule'
-        uses: slackapi/slack-github-action@v2
+        uses: anthropics/claude-code-action@v1
+        timeout-minutes: 30
         with:
-          webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
-          webhook-type: incoming-webhook
-          payload: |
-            text: ":red_circle: *TPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+          claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN || secrets.CLAUDE_MAX_OAUTH_TOKEN }}
+          prompt: |
+            Read .agents/skills/canary-triage/SKILL.md and follow it.
+          claude_args: |
+            --model opus
+            --max-turns 50
+            --allowedTools "Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
+        env:
+          CANARY_LANE: tpu
+          CANARY_JOB_ID: ${{ steps.submit.outputs.job_id }}
+          CANARY_RUN_ID: ${{ env.RUN_ID }}
+          IRIS_CONFIG: ${{ env.IRIS_CONFIG }}
+          WANDB_ENTITY: ${{ env.WANDB_ENTITY }}
+          WANDB_PROJECT: ${{ env.WANDB_PROJECT }}
+          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
+          GHA_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
+
+      - name: Notify Slack on failure
+        if: failure() && github.event_name == 'schedule'
+        env:
+          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
+          FALLBACK_TEXT: ":red_circle: *TPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
+        run: |
+          if [ -f slack_message.md ]; then
+            TEXT=$(cat slack_message.md)
+          else
+            TEXT="$FALLBACK_TEXT"
+          fi
+          PAYLOAD=$(python3 -c "import sys,json; print(json.dumps({'text': sys.stdin.read()}))" <<< "$TEXT")
+          curl -sf -X POST -H 'Content-Type: application/json' -d "$PAYLOAD" "$SLACK_WEBHOOK_URL"