Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.3.0] - 2025-10-07

[Compare with previous version](https://github.com/sparkfabrik/terraform-google-services-monitoring/compare/0.2.0...0.3.0)

### Changed

- Add kyverno alert log.
- Update module documentation.

## [0.2.0] - 2024-10-17
Expand Down
4 changes: 3 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
TERRAFORM_DOCS_VERSION ?= 0.20.0

.PHONY: lint tfscan generate-docs

lint:
Expand All @@ -10,4 +12,4 @@ generate-docs: lint
docker run --rm -u $$(id -u) \
--volume "$(PWD):/terraform-docs" \
-w /terraform-docs \
quay.io/terraform-docs/terraform-docs:0.16.0 markdown table --config .terraform-docs.yml --output-file README.md --output-mode inject .
quay.io/terraform-docs/terraform-docs:$(TERRAFORM_DOCS_VERSION) markdown table --config .terraform-docs.yml --output-file README.md --output-mode inject .
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,16 @@ This module creates a set of monitoring alerts for Google Cloud Platform service
Supported services:

- Cloud SQL

- CPU usage
- Storage usage
- Memory usage

- Kyverno

- Error logs for admission-controller, background-controller, cleanup-controller, reports-controller
- Metric threshold (optional)

<!-- BEGIN_TF_DOCS -->
## Providers

Expand All @@ -27,10 +33,10 @@ Supported services:

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_auto_close"></a> [auto\_close](#input\_auto\_close) | n/a | `string` | `"86400s"` | no |
| <a name="input_cloud_sql"></a> [cloud\_sql](#input\_cloud\_sql) | n/a | <pre>object({<br> project = optional(string, null)<br> auto_close = optional(string, null)<br> notification_channels = optional(list(string), [])<br> instances = optional(map(object({<br> cpu_utilization = optional(list(object({<br> severity = optional(string, "WARNING"),<br> threshold = optional(number, 0.90)<br> alignment_period = optional(string, "120s")<br> duration = optional(string, "300s")<br> })), [<br> {<br> threshold = 0.85,<br> duration = "1200s",<br> },<br> {<br> severity = "CRITICAL",<br> threshold = 1,<br> duration = "300s",<br> alignment_period = "60s",<br> }<br> ])<br> memory_utilization = optional(list(object({<br> severity = optional(string, "WARNING"),<br> threshold = optional(number, 0.90)<br> alignment_period = optional(string, "300s")<br> duration = optional(string, "300s")<br> })), [<br> {<br> severity = "WARNING",<br> },<br> {<br> severity = "CRITICAL",<br> threshold = 0.95,<br> }<br> ])<br> disk_utilization = optional(list(object({<br> severity = optional(string, "WARNING"),<br> threshold = optional(number, 0.85)<br> alignment_period = optional(string, "300s")<br> duration = optional(string, "600s")<br> })), [<br> {<br> severity = "WARNING",<br> },<br> {<br> severity = "CRITICAL",<br> threshold = 0.95, <br> }<br> ])<br> })), {})<br> })</pre> | n/a | yes |
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | n/a | `list(string)` | `[]` | no |
| <a name="input_project"></a> [project](#input\_project) | n/a | `string` | `null` | no |
| <a name="input_cloud_sql"></a> [cloud\_sql](#input\_cloud\_sql) | Configuration for Cloud SQL monitoring alerts. Supports customization of project, auto-close timing, notification channels, and per-instance alert thresholds for CPU, memory, and disk utilization. | <pre>object({<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "120s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> threshold = 0.85,<br/> duration = "1200s",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 1,<br/> duration = "300s",<br/> alignment_period = "60s",<br/> }<br/> ])<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> disk_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.85)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "600s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | n/a | yes |
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace. | <pre>object({<br/> cluster_name = string<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> # If true, use a metric threshold alert instead of log match alert otherwise use log match alert<br/> use_metric_threshold = optional(bool, false)<br/> metric_threshold_count = optional(number, 2)<br/> metric_lookback_minutes = optional(number, 1)<br/> auto_close_seconds = optional(number, 3600)<br/> enabled = optional(bool, true)<br/> filter_extra = optional(string, "")<br/> namespace = optional(string, "kyverno")<br/> })</pre> | n/a | yes |
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channel IDs to notify when an alert is triggered | `list(string)` | `[]` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The Google Cloud project ID where logging exclusions will be created | `string` | n/a | yes |

## Outputs

Expand All @@ -44,13 +50,15 @@ Supported services:

| Name | Type |
|------|------|
| [google_logging_metric.kyverno_error_metric](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/logging_metric) | resource |
| [google_monitoring_alert_policy.cloud_sql_cpu_utilization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.cloud_sql_disk_utilization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.cloud_sql_memory_utilization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.kyverno_logmatch_alert](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.kyverno_metric_threshold_alert](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |

## Modules

No modules.


<!-- END_TF_DOCS -->
25 changes: 11 additions & 14 deletions cloud-sql.tf
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,10 @@
# ----------------------
locals {
# Use the cloud_sql project if specified, otherwise use the project.
cloud_sql_project = var.cloud_sql.project != null ? var.cloud_sql.project : var.project
cloud_sql_project = var.cloud_sql.project_id != null ? var.cloud_sql.project_id : var.project_id

# Use the cloud_sql notification channels for if not specified in the configuration.
cloud_sql_notification_channels = length(var.cloud_sql.notification_channels) > 0 ? var.cloud_sql.notification_channels : var.notification_channels

# Use the cloud_sql auto_close if specified, otherwise use the auto_close.
cloud_sql_auto_close = var.cloud_sql.auto_close != null ? var.cloud_sql.auto_close : var.auto_close
cloud_sql_notification_channels = var.cloud_sql.notification_enabled ? (length(var.cloud_sql.notification_channels) > 0 ? var.cloud_sql.notification_channels : var.notification_channels) : []

cloud_sql_cpu_utilization = {
for item in flatten(
Expand All @@ -22,7 +19,7 @@ locals {
},
cpu_utilization
)
]
]
]
) : "${item.instance}--${item.severity}--${item.threshold}" => item
}
Expand All @@ -38,10 +35,10 @@ locals {
},
memory_utilization
)
]
]
]
) : "${item.instance}--${item.severity}--${item.threshold}" => item
}
}

cloud_sql_disk_utilization = {
for item in flatten(
Expand All @@ -54,10 +51,10 @@ locals {
},
disk_utilization
)
]
]
]
) : "${item.instance}--${item.severity}--${item.threshold}" => item
}
}
}

# ----------------------
Expand All @@ -67,7 +64,7 @@ resource "google_monitoring_alert_policy" "cloud_sql_cpu_utilization" {
for_each = local.cloud_sql_cpu_utilization

display_name = "${local.cloud_sql_project} ${each.value.instance} - CPU utilization ${each.value.severity} ${each.value.threshold * 100}%"
combiner = "OR"
combiner = "OR"
severity = each.value.severity

conditions {
Expand All @@ -87,7 +84,7 @@ resource "google_monitoring_alert_policy" "cloud_sql_cpu_utilization" {
display_name = "${local.cloud_sql_project} ${each.value.instance} - CPU utilization ${each.value.severity} ${each.value.threshold * 100}%"
}
alert_strategy {
auto_close = local.cloud_sql_auto_close
auto_close = var.cloud_sql.auto_close
}
notification_channels = local.cloud_sql_notification_channels
}
Expand Down Expand Up @@ -117,7 +114,7 @@ resource "google_monitoring_alert_policy" "cloud_sql_memory_utilization" {
}

alert_strategy {
auto_close = local.cloud_sql_auto_close
auto_close = var.cloud_sql.auto_close
}

notification_channels = local.cloud_sql_notification_channels
Expand Down Expand Up @@ -149,7 +146,7 @@ resource "google_monitoring_alert_policy" "cloud_sql_disk_utilization" {
}

alert_strategy {
auto_close = local.cloud_sql_auto_close
auto_close = var.cloud_sql.auto_close
}
notification_channels = local.cloud_sql_notification_channels
}
26 changes: 18 additions & 8 deletions examples/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@

locals {
# Enable all Cdoud SQL monitorings on selected instances, eg.
cloud_sql = {
instances = {
(google_sql_database_instance.master.name) = {}
cloud_sql = {
instances = {
(google_sql_database_instance.master.name) = {}
(google_sql_database_instance.stage.name) = {}
}
}
}
}

# Use custom Cloud SQL cpu monitoring on google_sql_database_instance.master.name
# Use all default Cloud SQL monitoring on google_sql_database_instance.stage.name
Expand All @@ -35,7 +35,7 @@ locals {
# cloud_sql = {
# instances = {
# (google_sql_database_instance.master.stage) = { cpu_utilization = [] }
# (google_sql_database_instance.master.prod) = {}
# (google_sql_database_instance.master.prod) = {}
# }
# }

Expand All @@ -46,6 +46,16 @@ module "example" {
version = ">= 0.1.0"

notification_channels = var.notification_channels
project = var.project
cloud_sql = local.cloud_sql
project_id = var.project_id
cloud_sql = local.cloud_sql
kyverno = {
cluster_name = "test-cluster"
enabled = true
use_metric_threshold = true
metric_threshold_count = 5
notification_channels = []
# Optional filter for log entries, exclude known non-actionable messages
# e.g., "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
filter_extra = "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
}
}
3 changes: 1 addition & 2 deletions examples/test.tfvars
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
project = "Simple project"

project_id = "simple-project"
notification_channels = [
"cloud_support_email",
"slack-channel"
Expand Down
29 changes: 24 additions & 5 deletions examples/variables.tf
Original file line number Diff line number Diff line change
@@ -1,10 +1,29 @@

variable "project" {
type = string
default = ""
variable "project_id" {
description = "The Google Cloud project ID where logging exclusions will be created"
type = string
}

variable "notification_channels" {
type = list(string)
default = []
description = "List of notification channel IDs to notify when an alert is triggered"
type = list(string)
default = []
}

variable "kyverno" {
description = "Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace."
type = object({
cluster_name = string
project_id = optional(string, null)
notification_channels = optional(list(string), [])
logmatch_notification_rate_limit = optional(string, "300s")
alert_documentation = optional(string, null)
use_metric_threshold = optional(bool, false)
metric_threshold_count = optional(number, 2)
metric_lookback_minutes = optional(number, 1)
auto_close_seconds = optional(number, 3600)
enabled = optional(bool, true)
filter_extra = optional(string, "")
namespace = optional(string, "kyverno")
})
}
118 changes: 118 additions & 0 deletions kyverno_log_alert.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
locals {
kyverno_project_id = var.kyverno.project_id != null ? var.kyverno.project_id : var.project_id
alert_documentation = var.kyverno.alert_documentation != null ? var.kyverno.alert_documentation : "Kyverno controllers produced ERROR logs in namespace ${var.kyverno.namespace}."
kyverno_notification_channels = var.kyverno.notification_enabled ? (length(var.kyverno.notification_channels) > 0 ? var.kyverno.notification_channels : var.notification_channels) : []

kyverno_log_filter = <<-EOT
resource.type="k8s_container"
resource.labels.project_id="${local.kyverno_project_id}"
resource.labels.cluster_name="${var.kyverno.cluster_name}"
resource.labels.namespace_name="${var.kyverno.namespace}"
severity>=ERROR
(
labels."k8s-pod/app_kubernetes_io/component"=~"(admission-controller|background-controller|cleanup-controller|reports-controller)"
OR resource.labels.pod_name=~"kyverno-(admission|background|cleanup|reports)-controller-.*"
)
${trimspace(var.kyverno.filter_extra)}
EOT

kyverno_metric_name = lower(replace(
"kyverno_error_logs_count_${var.kyverno.cluster_name}_${var.kyverno.namespace}",
"/[^a-zA-Z0-9_]/", "_"
))
}

resource "google_monitoring_alert_policy" "kyverno_logmatch_alert" {
count = (
var.kyverno.enabled
&& !var.kyverno.use_metric_threshold
&& trimspace(var.kyverno.cluster_name) != ""
) ? 1 : 0

display_name = "Kyverno controllers ERROR logs (namespace=${var.kyverno.namespace})"
combiner = "OR"
enabled = var.kyverno.enabled

conditions {
display_name = "Kyverno ERROR in logs"
condition_matched_log {
filter = local.kyverno_log_filter
}
}

documentation {
content = local.alert_documentation
mime_type = "text/markdown"
}

notification_channels = local.kyverno_notification_channels

alert_strategy {
auto_close = "${var.kyverno.auto_close_seconds}s"
notification_rate_limit {
period = var.kyverno.logmatch_notification_rate_limit
}
}
}

resource "google_logging_metric" "kyverno_error_metric" {
count = (
var.kyverno.enabled
&& var.kyverno.use_metric_threshold
&& trimspace(var.kyverno.cluster_name) != ""
) ? 1 : 0

name = local.kyverno_metric_name
description = "Count of ERROR+ logs from Kyverno controllers in namespace ${var.kyverno.namespace}"
filter = local.kyverno_log_filter

metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
unit = "1"
}
}

resource "google_monitoring_alert_policy" "kyverno_metric_threshold_alert" {
count = (
var.kyverno.enabled
&& var.kyverno.use_metric_threshold
&& trimspace(var.kyverno.cluster_name) != ""
) ? 1 : 0

display_name = "Kyverno ERROR rate alert (namespace=${var.kyverno.namespace})"
combiner = "OR"
enabled = var.kyverno.enabled

conditions {
display_name = "Kyverno ERROR rate alert >= ${var.kyverno.metric_threshold_count} logs in ${var.kyverno.metric_lookback_minutes} min (namespace ${var.kyverno.namespace})"
condition_threshold {
filter = "metric.type=\"logging.googleapis.com/user/${local.kyverno_metric_name}\" resource.type=\"global\""
comparison = "COMPARISON_GE"
threshold_value = var.kyverno.metric_threshold_count
duration = "0s"

aggregations {
alignment_period = "${var.kyverno.metric_lookback_minutes * 60}s"
per_series_aligner = "ALIGN_DELTA"
cross_series_reducer = "REDUCE_SUM"
group_by_fields = []
}

trigger {
count = 1
}
}
}

documentation {
content = local.alert_documentation
mime_type = "text/markdown"
}

notification_channels = local.kyverno_notification_channels

alert_strategy {
auto_close = "${var.kyverno.auto_close_seconds}s"
}
}
1 change: 1 addition & 0 deletions main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Loading