Add Claude triage step to canary ferry workflows#4177
Merged
Conversation
04a1153 to
14ba131
Compare
On scheduled canary failures (TPU and GPU), Claude now runs before cluster teardown to diagnose the failure, file a GitHub issue, and produce a Slack summary that replaces the static one-liner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14ba131 to
b80aeb4
Compare
rjpower
approved these changes
Mar 26, 2026
ravwojdyla
pushed a commit
that referenced
this pull request
Mar 26, 2026
## Summary
When a scheduled canary ferry fails on main, Claude now triages the
failure
before the cluster is torn down. It gathers diagnostics (kubectl/iris
logs,
pod state, events), identifies the root cause, files a GitHub issue with
structured context, and writes a `slack_message.md` that the Slack step
picks up instead of the old static one-liner.
- New skill: `.agents/skills/canary-triage/SKILL.md` — self-contained
triage prompt (diagnosis only, no code changes)
- Both TPU and GPU canary workflows updated with the same pattern
- Claude step runs between "Capture failure diagnostics" and cluster
teardown
(GPU) / Slack (TPU), so it has live cluster access
- Slack step falls back to the original message if Claude didn't produce
one
- 30-minute timeout, scheduled failures only
## Test plan
- [ ] Trigger GPU canary via `workflow_dispatch` with a low
`target_tokens`
to force a metric validation failure; verify Claude files an issue and
the Slack message includes root cause
- [ ] Same for TPU canary
- [ ] Verify a successful canary run is unaffected (Claude step is
skipped)
- [ ] Verify manual `workflow_dispatch` runs skip the Claude step
(`github.event_name != 'schedule'`)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helw150
pushed a commit
that referenced
this pull request
Apr 8, 2026
## Summary
When a scheduled canary ferry fails on main, Claude now triages the
failure
before the cluster is torn down. It gathers diagnostics (kubectl/iris
logs,
pod state, events), identifies the root cause, files a GitHub issue with
structured context, and writes a `slack_message.md` that the Slack step
picks up instead of the old static one-liner.
- New skill: `.agents/skills/canary-triage/SKILL.md` — self-contained
triage prompt (diagnosis only, no code changes)
- Both TPU and GPU canary workflows updated with the same pattern
- Claude step runs between "Capture failure diagnostics" and cluster
teardown
(GPU) / Slack (TPU), so it has live cluster access
- Slack step falls back to the original message if Claude didn't produce
one
- 30-minute timeout, scheduled failures only
## Test plan
- [ ] Trigger GPU canary via `workflow_dispatch` with a low
`target_tokens`
to force a metric validation failure; verify Claude files an issue and
the Slack message includes root cause
- [ ] Same for TPU canary
- [ ] Verify a successful canary run is unaffected (Claude step is
skipped)
- [ ] Verify manual `workflow_dispatch` runs skip the Claude step
(`github.event_name != 'schedule'`)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a scheduled canary ferry fails on main, Claude now triages the failure
before the cluster is torn down. It gathers diagnostics (kubectl/iris logs,
pod state, events), identifies the root cause, files a GitHub issue with
structured context, and writes a
slack_message.mdthat the Slack steppicks up instead of the old static one-liner.
.agents/skills/canary-triage/SKILL.md— self-containedtriage prompt (diagnosis only, no code changes)
(GPU) / Slack (TPU), so it has live cluster access
Test plan
workflow_dispatchwith a lowtarget_tokensto force a metric validation failure; verify Claude files an issue and
the Slack message includes root cause
workflow_dispatchruns skip the Claude step(
github.event_name != 'schedule')🤖 Generated with Claude Code