You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Slack alerting for scheduled job failures (#10)
## Summary:
This PR adds Slack alerting functionality to the scheduled-job module. The alerting system uses Google Cloud Monitoring's native Slack webhook integration to send notifications when jobs or functions fail.
Key features:
- Slack alerting enabled by default for all scheduled jobs
- Monitors Cloud Function failures and Cloud Run Job failures
- Uses native Google Cloud Monitoring Slack webhook integration (no custom functions needed)
- Configurable alert channel
Issue: INFRA-10729
## Test plan:
- Deploy the culture-cron with alerting enabled
- Verify monitoring policies are created in Google Cloud Console
- Test by simulating a job failure and confirming Slack notifications are received
See the alert in the monitoring dashboard [here](https://console.cloud.google.com/monitoring/alerting/policies/15420096590437615790?project=khan-internal-services). A slack message sent by this alert can be found [here](https://khanacademy.slack.com/archives/C090KRE5P/p1760116927376419).
Author: jwbron
Reviewers: copilot-pull-request-reviewer[bot], csilvers
Required Reviewers:
Approved By: csilvers
Checks: ✅ 3 checks were successful, ⏭️ 1 check has been skipped
Pull Request URL: #10
The module supports optional Slack alerting for job failures. When enabled, it creates:
427
+
428
+
-**Monitoring policies**: Cloud Monitoring alert policies for different failure scenarios
429
+
-**Slack notification channel**: Direct integration with Slack using the Slack API token from Secret Manager
430
+
431
+
**Note**: The module automatically fetches the Slack API token from Secret Manager in the `khan-academy` project (secret: `Slack__API_token_for_alertlib`). Ensure your Terraform service account has access to read this secret.
display_name="${var.job_name} Function Failure Alert"
332
+
combiner="OR"
333
+
enabled=true
334
+
335
+
alert_strategy {
336
+
auto_close="86400s"# Auto-close after 24 hours if condition is no longer met
337
+
}
338
+
339
+
conditions {
340
+
display_name="${var.job_name} function execution failure"
341
+
342
+
condition_threshold {
343
+
filter="resource.type=\"cloud_function\" AND resource.labels.function_name=\"${var.job_name}\" AND metric.type=\"cloudfunctions.googleapis.com/function/execution_count\" AND metric.labels.status!=\"ok\""
content="The Cloud Function ${var.job_name} has failed to execute. Check the function logs for more details.\n\n[View Function in Console](${local.function_console_url})${local.slack_cc_mention}"
366
+
mime_type="text/markdown"
367
+
}
368
+
}
369
+
370
+
# Monitoring policy for Cloud Run Job failures (when execution_type is "job")
filter="resource.type=\"cloud_run_job\" AND resource.labels.job_name=\"${var.job_name}\" AND metric.type=\"run.googleapis.com/job/completed_execution_count\" AND metric.labels.result!=\"succeeded\""
content="The Cloud Run Job ${var.job_name} has failed to execute or complete successfully. Check the job logs for more details.\n\n[View Job in Console](${local.job_console_url})${local.slack_cc_mention}"
0 commit comments