Skip to content

Commit 509c821

Browse files
authored
Add Slack alerting for scheduled job failures (#10)
## Summary: This PR adds Slack alerting functionality to the scheduled-job module. The alerting system uses Google Cloud Monitoring's native Slack webhook integration to send notifications when jobs or functions fail. Key features: - Slack alerting enabled by default for all scheduled jobs - Monitors Cloud Function failures and Cloud Run Job failures - Uses native Google Cloud Monitoring Slack webhook integration (no custom functions needed) - Configurable alert channel Issue: INFRA-10729 ## Test plan: - Deploy the culture-cron with alerting enabled - Verify monitoring policies are created in Google Cloud Console - Test by simulating a job failure and confirming Slack notifications are received See the alert in the monitoring dashboard [here](https://console.cloud.google.com/monitoring/alerting/policies/15420096590437615790?project=khan-internal-services). A slack message sent by this alert can be found [here](https://khanacademy.slack.com/archives/C090KRE5P/p1760116927376419). Author: jwbron Reviewers: copilot-pull-request-reviewer[bot], csilvers Required Reviewers: Approved By: csilvers Checks: ✅ 3 checks were successful, ⏭️ 1 check has been skipped Pull Request URL: #10
1 parent 7da73b7 commit 509c821

File tree

8 files changed

+299
-3
lines changed

8 files changed

+299
-3
lines changed

terraform/modules/github-ci-bootstrap/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,8 @@ With dual service accounts, you can also conditionally run different Terraform o
266266
run: terraform plan
267267

268268
- name: Terraform Apply (Write-Enabled Branches Only)
269-
if: contains(fromJSON('["refs/heads/main", "refs/heads/master"]'), github.ref)
269+
if: contains(fromJSON('["refs/heads/main", "refs/heads/master"]'),
270+
github.ref)
270271
run: terraform apply -auto-approve
271272
```
272273

terraform/modules/scheduled-job/README.md

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Creates a complete scheduled setup:
1313
- Storage bucket with lifecycle management
1414
- Secret Manager IAM bindings
1515
- Source code change detection
16+
- **Slack alerting** for job failures (optional)
1617

1718
## Quick Start
1819

@@ -88,6 +89,10 @@ module "my_data_processor" {
8889
version = "latest"
8990
}
9091
]
92+
93+
# Enable Slack alerting for job failures (enabled by default)
94+
slack_channel = "#channel-name"
95+
slack_mention_users = ["@group-or-user"]
9196
}
9297
```
9398

@@ -253,6 +258,13 @@ module "data_processor" {
253258
- `job_args` - Command arguments ([])
254259
- `job_image` - Container image URL (required)
255260

261+
### Alerting (optional)
262+
263+
- `enable_alerting` - Whether to enable alerting for job failures (true)
264+
- `slack_channel` - Slack channel to send notifications to (e.g., "#1s-and-0s") (required when alerting enabled)
265+
- `slack_mention_users` - List of Slack users or groups to mention in alerts (e.g., ["@user", "@group"]) ([])
266+
- `alert_project_id` - GCP project ID where monitoring and alerting resources will be created (defaults to project_id) (null)
267+
256268
## Outputs
257269

258270
- `resource_name` - Name of deployed function or job
@@ -263,6 +275,11 @@ module "data_processor" {
263275
- `storage_bucket_name` - Storage bucket name
264276
- `execution_type` - The execution type used
265277

278+
### Alerting Outputs (when `enable_alerting = true`)
279+
280+
- `monitoring_notification_channel_name` - Name of the monitoring notification channel
281+
- `alert_policy_names` - Names of the monitoring alert policies
282+
266283
## Repository Structure
267284

268285
```
@@ -404,6 +421,74 @@ Or use Cloud Build directly:
404421
gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/YOUR_JOB_NAME:latest ./jobs/your-job
405422
```
406423

424+
## Alerting
425+
426+
The module supports optional Slack alerting for job failures. When enabled, it creates:
427+
428+
- **Monitoring policies**: Cloud Monitoring alert policies for different failure scenarios
429+
- **Slack notification channel**: Direct integration with Slack using the Slack API token from Secret Manager
430+
431+
**Note**: The module automatically fetches the Slack API token from Secret Manager in the `khan-academy` project (secret: `Slack__API_token_for_alertlib`). Ensure your Terraform service account has access to read this secret.
432+
433+
### Enabling Alerting
434+
435+
```hcl
436+
module "my_job_with_alerts" {
437+
source = "git::https://github.com/Khan/terraform-modules.git//terraform/modules/scheduled-job?ref=v1.0.0"
438+
439+
# ... other configuration ...
440+
441+
# Alerting is enabled by default
442+
slack_channel = "#my-team-channel"
443+
slack_mention_users = ["@oncall", "@team-leads"] # Optional: users/groups to mention
444+
445+
# Optional: Use different project for alerting resources
446+
alert_project_id = "my-monitoring-project"
447+
}
448+
```
449+
450+
### What Gets Monitored
451+
452+
When alerting is enabled, the module creates monitoring policies for:
453+
454+
1. **Cloud Function failures** (when `execution_type = "function"`)
455+
456+
- Monitors `cloudfunctions.googleapis.com/function/execution_count` with `status="error"`
457+
- Alerts immediately when any function execution fails
458+
- Includes direct link to function logs in console
459+
460+
2. **Cloud Run Job failures** (when `execution_type = "job"`)
461+
462+
- Monitors `run.googleapis.com/job/completed_task_attempt_count` and `failed_task_attempt_count`
463+
- Alerts when tasks fail or don't complete within expected time
464+
- Includes direct link to job logs in console
465+
466+
### Slack Message Format
467+
468+
The Slack notifications include:
469+
470+
- Alert description with job/function name
471+
- Direct link to GCP Console logs for troubleshooting
472+
- Optional CC mentions for users/groups (configured via `slack_mention_users`)
473+
- Markdown-formatted for readability
474+
475+
Example alert message:
476+
477+
```
478+
The Cloud Function my-function has failed to execute. Check the function logs for more details.
479+
480+
[View Function in Console](https://console.cloud.google.com/...)
481+
482+
CC: @oncall @team-leads
483+
```
484+
485+
### Security
486+
487+
- Slack API token is fetched from Secret Manager in the `khan-academy` project
488+
- Token is stored securely in the monitoring notification channel's sensitive labels
489+
- All alerting resources are created in the specified project (or same project as the job)
490+
- Requires Secret Manager read permissions on the `Slack__API_token_for_alertlib` secret
491+
407492
## Common Cron Patterns
408493

409494
| Schedule | Description |

terraform/modules/scheduled-job/examples/simple-job/README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ This example demonstrates how to use the scheduled-job module to create a Cloud
99
- Service account with appropriate permissions
1010
- Container image built automatically using Cloud Build
1111
- Secret Manager IAM bindings
12+
- Slack alerting for job failures (enabled by default)
1213

1314
## Key differences from Cloud Functions
1415

@@ -25,6 +26,7 @@ This example demonstrates how to use the scheduled-job module to create a Cloud
2526
```bash
2627
export TF_VAR_project_id="your-gcp-project"
2728
export TF_VAR_secrets_project_id="your-secrets-project"
29+
export TF_VAR_slack_channel="#my-team-channel"
2830
```
2931

3032
2. Initialize and apply:
@@ -71,6 +73,10 @@ module "daily_data_processor" {
7173
job_command = ["python", "processor.py"]
7274
job_args = [] # Additional arguments if needed
7375
76+
# Alerting is enabled by default
77+
slack_channel = var.slack_channel
78+
slack_mention_users = ["@oncall"] # Optional
79+
7480
# ... other configuration
7581
}
7682
```
@@ -92,3 +98,17 @@ The job code in `job-code/processor.py` is a simple Python script that:
9298
- **Branch-based Caching**: Cloud Build caches layers based on branch names for faster builds.
9399
- Jobs are triggered via HTTP calls to the Cloud Run Jobs API, not via PubSub like Cloud Functions.
94100
- Jobs can run for longer periods and have more resources than Cloud Functions.
101+
102+
## Alerting
103+
104+
This example includes Slack alerting for job failures by default. The alerting system:
105+
106+
- Monitors job execution failures and task completion issues
107+
- Sends notifications to your specified Slack channel
108+
- Uses Slack API token from Secret Manager (`khan-academy` project)
109+
- Provides detailed failure information with direct links to logs
110+
- Supports mentioning specific users/groups via `slack_mention_users`
111+
112+
**Note**: Requires read access to the `Slack__API_token_for_alertlib` secret in the `khan-academy` project.
113+
114+
To disable alerting, set `enable_alerting = false` in the module configuration.

terraform/modules/scheduled-job/examples/simple-job/main.tf

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,13 @@ module "daily_data_processor" {
6969
version = "latest"
7070
}
7171
]
72+
73+
# Alerting is enabled by default
74+
slack_channel = var.slack_channel
75+
slack_mention_users = ["@oncall"] # Optional: mention specific users/groups
76+
77+
# Optional: Use different project for alerting resources
78+
alert_project_id = var.alert_project_id
7279
}
7380

7481
# Output the job details
@@ -91,3 +98,12 @@ output "image_info" {
9198
image_tag = module.daily_data_processor_image.image_tag
9299
}
93100
}
101+
102+
# Output alerting information
103+
output "alerting_info" {
104+
description = "Information about the alerting setup"
105+
value = {
106+
monitoring_notification_channel_name = module.daily_data_processor.monitoring_notification_channel_name
107+
alert_policy_names = module.daily_data_processor.alert_policy_names
108+
}
109+
}

terraform/modules/scheduled-job/examples/simple-job/variables.tf

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,14 @@ variable "region" {
1313
type = string
1414
default = "us-central1"
1515
}
16+
17+
variable "slack_channel" {
18+
description = "Slack channel to send notifications to (e.g., '#my-team-channel')"
19+
type = string
20+
}
21+
22+
variable "alert_project_id" {
23+
description = "GCP project ID where monitoring and alerting resources will be created (optional, defaults to project_id)"
24+
type = string
25+
default = null
26+
}

terraform/modules/scheduled-job/main.tf

Lines changed: 126 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -284,4 +284,129 @@ resource "google_cloud_scheduler_job" "job_scheduler" {
284284
scope = "https://www.googleapis.com/auth/cloud-platform"
285285
}
286286
}
287-
}
287+
}
288+
289+
# Alerting resources (only created when enable_alerting is true)
290+
291+
# Fetch Slack API token from Secret Manager
292+
data "google_secret_manager_secret_version" "slack_token" {
293+
count = var.enable_alerting ? 1 : 0
294+
295+
project = "khan-academy"
296+
secret = "Slack__API_token_for_alertlib"
297+
}
298+
299+
locals {
300+
alert_project_id = var.alert_project_id != null ? var.alert_project_id : var.project_id
301+
slack_auth_token = var.enable_alerting ? data.google_secret_manager_secret_version.slack_token[0].secret_data : null
302+
slack_cc_mention = length(var.slack_mention_users) > 0 ? "\n\nCC: ${join(" ", var.slack_mention_users)}" : ""
303+
304+
# Console URLs for functions and jobs
305+
function_console_url = "https://console.cloud.google.com/run/detail/${var.region}/${var.job_name}/observability/logs?project=${var.project_id}"
306+
job_console_url = "https://console.cloud.google.com/run/jobs/detail/${var.region}/${var.job_name}/observability/logs?project=${var.project_id}"
307+
}
308+
309+
# Monitoring notification channel for Slack
310+
resource "google_monitoring_notification_channel" "slack_channel" {
311+
count = var.enable_alerting ? 1 : 0
312+
313+
project = local.alert_project_id
314+
display_name = "${var.job_name} Slack Alerts"
315+
type = "slack"
316+
317+
labels = {
318+
channel_name = var.slack_channel
319+
}
320+
321+
sensitive_labels {
322+
auth_token = local.slack_auth_token
323+
}
324+
}
325+
326+
# Monitoring policy for Cloud Function failures (when execution_type is "function")
327+
resource "google_monitoring_alert_policy" "function_failure" {
328+
count = var.enable_alerting && var.execution_type == "function" ? 1 : 0
329+
330+
project = local.alert_project_id
331+
display_name = "${var.job_name} Function Failure Alert"
332+
combiner = "OR"
333+
enabled = true
334+
335+
alert_strategy {
336+
auto_close = "86400s" # Auto-close after 24 hours if condition is no longer met
337+
}
338+
339+
conditions {
340+
display_name = "${var.job_name} function execution failure"
341+
342+
condition_threshold {
343+
filter = "resource.type=\"cloud_function\" AND resource.labels.function_name=\"${var.job_name}\" AND metric.type=\"cloudfunctions.googleapis.com/function/execution_count\" AND metric.labels.status!=\"ok\""
344+
345+
comparison = "COMPARISON_GT"
346+
threshold_value = 0
347+
348+
duration = "60s"
349+
350+
aggregations {
351+
alignment_period = "60s"
352+
per_series_aligner = "ALIGN_DELTA"
353+
group_by_fields = ["resource.service_name"]
354+
}
355+
356+
trigger {
357+
count = 1
358+
}
359+
}
360+
}
361+
362+
notification_channels = [google_monitoring_notification_channel.slack_channel[0].name]
363+
364+
documentation {
365+
content = "The Cloud Function ${var.job_name} has failed to execute. Check the function logs for more details.\n\n[View Function in Console](${local.function_console_url})${local.slack_cc_mention}"
366+
mime_type = "text/markdown"
367+
}
368+
}
369+
370+
# Monitoring policy for Cloud Run Job failures (when execution_type is "job")
371+
resource "google_monitoring_alert_policy" "job_failure" {
372+
count = var.enable_alerting && var.execution_type == "job" ? 1 : 0
373+
374+
project = local.alert_project_id
375+
display_name = "${var.job_name} Job Failure Alert"
376+
combiner = "OR"
377+
enabled = true
378+
379+
alert_strategy {
380+
auto_close = "86400s" # Auto-close after 24 hours if condition is no longer met
381+
}
382+
383+
conditions {
384+
display_name = "${var.job_name} job execution failure"
385+
386+
condition_threshold {
387+
filter = "resource.type=\"cloud_run_job\" AND resource.labels.job_name=\"${var.job_name}\" AND metric.type=\"run.googleapis.com/job/completed_execution_count\" AND metric.labels.result!=\"succeeded\""
388+
389+
comparison = "COMPARISON_GT"
390+
threshold_value = 0
391+
392+
duration = "60s"
393+
394+
aggregations {
395+
alignment_period = "60s"
396+
per_series_aligner = "ALIGN_DELTA"
397+
group_by_fields = ["resource.service_name"]
398+
}
399+
400+
trigger {
401+
count = 1
402+
}
403+
}
404+
}
405+
406+
notification_channels = [google_monitoring_notification_channel.slack_channel[0].name]
407+
408+
documentation {
409+
content = "The Cloud Run Job ${var.job_name} has failed to execute or complete successfully. Check the job logs for more details.\n\n[View Job in Console](${local.job_console_url})${local.slack_cc_mention}"
410+
mime_type = "text/markdown"
411+
}
412+
}

terraform/modules/scheduled-job/outputs.tf

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,4 +53,18 @@ output "region" {
5353
output "execution_type" {
5454
description = "The execution type used (function or job)"
5555
value = var.execution_type
56-
}
56+
}
57+
58+
# Alerting outputs
59+
output "monitoring_notification_channel_name" {
60+
description = "Name of the monitoring notification channel (when alerting is enabled)"
61+
value = var.enable_alerting ? google_monitoring_notification_channel.slack_channel[0].name : null
62+
}
63+
64+
output "alert_policy_names" {
65+
description = "Names of the monitoring alert policies (when alerting is enabled)"
66+
value = var.enable_alerting ? {
67+
function_failure = var.execution_type == "function" ? google_monitoring_alert_policy.function_failure[0].display_name : null
68+
job_failure = var.execution_type == "job" ? google_monitoring_alert_policy.job_failure[0].display_name : null
69+
} : null
70+
}

terraform/modules/scheduled-job/variables.tf

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,3 +194,27 @@ variable "job_image" {
194194
type = string
195195
default = null
196196
}
197+
198+
# Alerting configuration
199+
variable "enable_alerting" {
200+
description = "Whether to enable alerting for job failures"
201+
type = bool
202+
default = true
203+
}
204+
205+
variable "slack_channel" {
206+
description = "Slack channel to send notifications to (e.g., '#1s-and-0s')"
207+
type = string
208+
}
209+
210+
variable "slack_mention_users" {
211+
description = "List of Slack users or groups to mention in alerts (e.g., ['@user', '@group'])"
212+
type = list(string)
213+
default = []
214+
}
215+
216+
variable "alert_project_id" {
217+
description = "GCP project ID where monitoring and alerting resources will be created (defaults to project_id)"
218+
type = string
219+
default = null
220+
}

0 commit comments

Comments
 (0)