Skip to content

Add Claude triage step to canary ferry workflows#4177

Merged
yonromai merged 1 commit intomainfrom
romain/canary-triage
Mar 26, 2026
Merged

Add Claude triage step to canary ferry workflows#4177
yonromai merged 1 commit intomainfrom
romain/canary-triage

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

Summary

When a scheduled canary ferry fails on main, Claude now triages the failure
before the cluster is torn down. It gathers diagnostics (kubectl/iris logs,
pod state, events), identifies the root cause, files a GitHub issue with
structured context, and writes a slack_message.md that the Slack step
picks up instead of the old static one-liner.

  • New skill: .agents/skills/canary-triage/SKILL.md — self-contained
    triage prompt (diagnosis only, no code changes)
  • Both TPU and GPU canary workflows updated with the same pattern
  • Claude step runs between "Capture failure diagnostics" and cluster teardown
    (GPU) / Slack (TPU), so it has live cluster access
  • Slack step falls back to the original message if Claude didn't produce one
  • 30-minute timeout, scheduled failures only

Test plan

  • Trigger GPU canary via workflow_dispatch with a low target_tokens
    to force a metric validation failure; verify Claude files an issue and
    the Slack message includes root cause
  • Same for TPU canary
  • Verify a successful canary run is unaffected (Claude step is skipped)
  • Verify manual workflow_dispatch runs skip the Claude step
    (github.event_name != 'schedule')

🤖 Generated with Claude Code

@yonromai yonromai added the agent-generated Created by automation/agent label Mar 26, 2026
@yonromai yonromai force-pushed the romain/canary-triage branch from 04a1153 to 14ba131 Compare March 26, 2026 17:11
@yonromai yonromai requested a review from rjpower March 26, 2026 17:18
On scheduled canary failures (TPU and GPU), Claude now runs before
cluster teardown to diagnose the failure, file a GitHub issue, and
produce a Slack summary that replaces the static one-liner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yonromai yonromai force-pushed the romain/canary-triage branch from 14ba131 to b80aeb4 Compare March 26, 2026 17:22
@yonromai yonromai merged commit ca1636c into main Mar 26, 2026
42 checks passed
@yonromai yonromai deleted the romain/canary-triage branch March 26, 2026 17:29
ravwojdyla pushed a commit that referenced this pull request Mar 26, 2026
## Summary

When a scheduled canary ferry fails on main, Claude now triages the
failure
before the cluster is torn down. It gathers diagnostics (kubectl/iris
logs,
pod state, events), identifies the root cause, files a GitHub issue with
structured context, and writes a `slack_message.md` that the Slack step
picks up instead of the old static one-liner.

- New skill: `.agents/skills/canary-triage/SKILL.md` — self-contained
  triage prompt (diagnosis only, no code changes)
- Both TPU and GPU canary workflows updated with the same pattern
- Claude step runs between "Capture failure diagnostics" and cluster
teardown
  (GPU) / Slack (TPU), so it has live cluster access
- Slack step falls back to the original message if Claude didn't produce
one
- 30-minute timeout, scheduled failures only

## Test plan

- [ ] Trigger GPU canary via `workflow_dispatch` with a low
`target_tokens`
to force a metric validation failure; verify Claude files an issue and
      the Slack message includes root cause
- [ ] Same for TPU canary
- [ ] Verify a successful canary run is unaffected (Claude step is
skipped)
- [ ] Verify manual `workflow_dispatch` runs skip the Claude step
      (`github.event_name != 'schedule'`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
## Summary

When a scheduled canary ferry fails on main, Claude now triages the
failure
before the cluster is torn down. It gathers diagnostics (kubectl/iris
logs,
pod state, events), identifies the root cause, files a GitHub issue with
structured context, and writes a `slack_message.md` that the Slack step
picks up instead of the old static one-liner.

- New skill: `.agents/skills/canary-triage/SKILL.md` — self-contained
  triage prompt (diagnosis only, no code changes)
- Both TPU and GPU canary workflows updated with the same pattern
- Claude step runs between "Capture failure diagnostics" and cluster
teardown
  (GPU) / Slack (TPU), so it has live cluster access
- Slack step falls back to the original message if Claude didn't produce
one
- 30-minute timeout, scheduled failures only

## Test plan

- [ ] Trigger GPU canary via `workflow_dispatch` with a low
`target_tokens`
to force a metric validation failure; verify Claude files an issue and
      the Slack message includes root cause
- [ ] Same for TPU canary
- [ ] Verify a successful canary run is unaffected (Claude step is
skipped)
- [ ] Verify manual `workflow_dispatch` runs skip the Claude step
      (`github.event_name != 'schedule'`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants