Add Claude triage step to canary ferry workflows by yonromai · Pull Request #4177 · marin-community/marin

yonromai · 2026-03-26T17:08:56Z

Summary

When a scheduled canary ferry fails on main, Claude now triages the failure
before the cluster is torn down. It gathers diagnostics (kubectl/iris logs,
pod state, events), identifies the root cause, files a GitHub issue with
structured context, and writes a slack_message.md that the Slack step
picks up instead of the old static one-liner.

New skill: .agents/skills/canary-triage/SKILL.md — self-contained
triage prompt (diagnosis only, no code changes)
Both TPU and GPU canary workflows updated with the same pattern
Claude step runs between "Capture failure diagnostics" and cluster teardown
(GPU) / Slack (TPU), so it has live cluster access
Slack step falls back to the original message if Claude didn't produce one
30-minute timeout, scheduled failures only

Test plan

Trigger GPU canary via workflow_dispatch with a low target_tokens
to force a metric validation failure; verify Claude files an issue and
the Slack message includes root cause
Same for TPU canary
Verify a successful canary run is unaffected (Claude step is skipped)
Verify manual workflow_dispatch runs skip the Claude step
(github.event_name != 'schedule')

🤖 Generated with Claude Code

On scheduled canary failures (TPU and GPU), Claude now runs before cluster teardown to diagnose the failure, file a GitHub issue, and produce a Slack summary that replaces the static one-liner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary When a scheduled canary ferry fails on main, Claude now triages the failure before the cluster is torn down. It gathers diagnostics (kubectl/iris logs, pod state, events), identifies the root cause, files a GitHub issue with structured context, and writes a `slack_message.md` that the Slack step picks up instead of the old static one-liner. - New skill: `.agents/skills/canary-triage/SKILL.md` — self-contained triage prompt (diagnosis only, no code changes) - Both TPU and GPU canary workflows updated with the same pattern - Claude step runs between "Capture failure diagnostics" and cluster teardown (GPU) / Slack (TPU), so it has live cluster access - Slack step falls back to the original message if Claude didn't produce one - 30-minute timeout, scheduled failures only ## Test plan - [ ] Trigger GPU canary via `workflow_dispatch` with a low `target_tokens` to force a metric validation failure; verify Claude files an issue and the Slack message includes root cause - [ ] Same for TPU canary - [ ] Verify a successful canary run is unaffected (Claude step is skipped) - [ ] Verify manual `workflow_dispatch` runs skip the Claude step (`github.event_name != 'schedule'`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yonromai added the agent-generated Created by automation/agent label Mar 26, 2026

yonromai force-pushed the romain/canary-triage branch from 04a1153 to 14ba131 Compare March 26, 2026 17:11

yonromai requested a review from rjpower March 26, 2026 17:18

yonromai force-pushed the romain/canary-triage branch from 14ba131 to b80aeb4 Compare March 26, 2026 17:22

rjpower approved these changes Mar 26, 2026

View reviewed changes

yonromai merged commit ca1636c into main Mar 26, 2026
42 checks passed

yonromai deleted the romain/canary-triage branch March 26, 2026 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Claude triage step to canary ferry workflows#4177

Add Claude triage step to canary ferry workflows#4177
yonromai merged 1 commit intomainfrom
romain/canary-triage

yonromai commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yonromai commented Mar 26, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants