Skip to content

Commit 1ef9f96

Browse files
yoblinclaude
andcommitted
Add Slack alerts for canary workflow failures
Notify #oa-marin-eng when the daily GPU or TPU canary fails on schedule. Uses slackapi/slack-github-action with the SLACK_WEBHOOK_URL secret. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d7c4111 commit 1ef9f96

2 files changed

Lines changed: 18 additions & 0 deletions

File tree

.github/workflows/marin-canary-ferry-cw.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,15 @@ jobs:
185185
# `cluster stop` only deletes Pods; NodePools survive and rely on the
186186
# CW autoscaler to scale down. Delete them explicitly to avoid lingering
187187
# H100 costs.
188+
- name: Notify Slack on failure
189+
if: failure() && github.event_name == 'schedule'
190+
uses: slackapi/slack-github-action@v2
191+
with:
192+
webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
193+
webhook-type: incoming-webhook
194+
payload: |
195+
text: ":red_circle: *GPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
196+
188197
- name: Tear down CoreWeave cluster
189198
if: always()
190199
run: |

.github/workflows/marin-canary-ferry.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,3 +148,12 @@ jobs:
148148
echo "=== Job list ==="
149149
.venv/bin/iris --config=${{ env.IRIS_CONFIG }} \
150150
job list --json 2>/dev/null | jq '.[0:5]' || true
151+
152+
- name: Notify Slack on failure
153+
if: failure() && github.event_name == 'schedule'
154+
uses: slackapi/slack-github-action@v2
155+
with:
156+
webhook: ${{ secrets.SLACK_WEBHOOK_URL }}
157+
webhook-type: incoming-webhook
158+
payload: |
159+
text: ":red_circle: *TPU Canary failed*\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"

0 commit comments

Comments
 (0)