Skip to content

Commit 6dd6630

Browse files
refs platform/#3911: add CPU utilization monitoring alerts for Memorystore Redis (#27)
1 parent 2564daf commit 6dd6630

File tree

7 files changed

+396
-8
lines changed

7 files changed

+396
-8
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,14 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
88

99
## [Unreleased]
1010

11+
## [0.15.0] - 2026-02-09
12+
13+
[Compare with previous version](https://github.com/sparkfabrik/terraform-google-services-monitoring/compare/0.14.0...0.15.0)
14+
15+
### Added
16+
17+
- refs platform/#3911: add CPU utilization and Memory usage monitoring alerts for Redis instances and clusters
18+
1119
## [0.14.0] - 2026-02-05
1220

1321
[Compare with previous version](https://github.com/sparkfabrik/terraform-google-services-monitoring/compare/0.13.0...0.14.0)

README.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,33 +5,31 @@ This module creates a set of monitoring alerts for Google Cloud Platform service
55
Supported services:
66

77
- Cloud SQL
8-
98
- CPU usage
109
- Storage usage
1110
- Memory usage
1211

13-
- Kyverno
12+
- Memorystore
13+
- CPU utilization alerts for Redis instances and Redis clusters
14+
- Memory (system memory usage ratio) alerts for Redis instances and Redis clusters
1415

16+
- Kyverno
1517
- Error logs for admission-controller, background-controller, cleanup-controller, reports-controller
1618

1719
- cert-manager
18-
1920
- Error logs for cert-manager controller when an Issuer or ClusterIssuer is missing
2021

2122
- Konnectivity agent
22-
2323
- Alert when no pods are available for the konnectivity-agent deployment
24-
- SSL certificate expiration
2524

25+
- SSL certificate expiration
2626
- SSL certificate expiry alerts for monitored endpoints
2727

2828
- Typesense
29-
3029
- Uptime checks for HTTP endpoints
3130
- Pod restart alerts for Typesense containers
3231

3332
- LiteLLM
34-
3533
- Uptime checks for HTTP endpoints
3634
- Pod restart alerts for LiteLLM containers
3735

@@ -40,7 +38,7 @@ Supported services:
4038

4139
| Name | Version |
4240
|------|---------|
43-
| <a name="provider_google"></a> [google](#provider\_google) | >= 5.10 |
41+
| <a name="provider_google"></a> [google](#provider\_google) | 7.18.0 |
4442

4543
## Requirements
4644

@@ -58,6 +56,7 @@ Supported services:
5856
| <a name="input_konnectivity_agent"></a> [konnectivity\_agent](#input\_konnectivity\_agent) | Configuration for Konnectivity agent deployment replica alert in GKE. Triggers when there are no available replicas. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> namespace = optional(string, "kube-system")<br/> deployment_name = optional(string, "konnectivity-agent")<br/> duration_seconds = optional(number, 60)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> notification_prompts = optional(list(string), null)<br/> })</pre> | `{}` | no |
5957
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, message pattern inclusions/exclusions for jsonPayload.message matching, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> namespace = optional(string, "kyverno")<br/> # List of message patterns to exclude from the default set (matches against jsonPayload.message).<br/> # Default patterns available for exclusion:<br/> # "failed to list resources", "failed to watch resource", "failed to start watcher",<br/> # "failed to sync", "failed to run warmup", "failed to load certificate",<br/> # "failed to update lock", "failed to process request", "failed to check permissions",<br/> # "failed to scan resource", "failed to fetch data", "failed to substitute variables",<br/> # "failed calling webhook", "leader election lost", "dropping request", "panic"<br/> error_patterns_exclude = optional(list(string), [])<br/> # List of additional regex message patterns to include (added to default set)<br/> # e.g. ["failed to authenticate.", "failed to connect."]<br/> error_patterns_include = optional(list(string), [])<br/> })</pre> | `{}` | no |
6058
| <a name="input_litellm"></a> [litellm](#input\_litellm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 180)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_prompts = optional(list(string), null)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
59+
| <a name="input_memorystore"></a> [memorystore](#input\_memorystore) | Configuration for GCP Memorystore (Redis) CPU monitoring alerts. Supports both Redis instances and Redis clusters with multiple threshold levels. Each resource is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/><br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), []<br/> )<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.80,<br/> }<br/> ])<br/> })), {})<br/><br/> clusters = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), []<br/> )<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.80,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | `{}` | no |
6160
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channel IDs to notify when an alert is triggered | `list(string)` | `[]` | no |
6261
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The Google Cloud project ID where logging exclusions will be created | `string` | n/a | yes |
6362
| <a name="input_ssl_alert"></a> [ssl\_alert](#input\_ssl\_alert) | Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> threshold_days = optional(list(number), [15, 7])<br/> user_labels = optional(map(string), {})<br/> })</pre> | `{}` | no |
@@ -70,6 +69,10 @@ Supported services:
7069
| <a name="output_cloud_sql_cpu_utilization"></a> [cloud\_sql\_cpu\_utilization](#output\_cloud\_sql\_cpu\_utilization) | n/a |
7170
| <a name="output_cloud_sql_disk_utilization"></a> [cloud\_sql\_disk\_utilization](#output\_cloud\_sql\_disk\_utilization) | n/a |
7271
| <a name="output_cloud_sql_memory_utilization"></a> [cloud\_sql\_memory\_utilization](#output\_cloud\_sql\_memory\_utilization) | n/a |
72+
| <a name="output_memorystore_cluster_cpu_utilization"></a> [memorystore\_cluster\_cpu\_utilization](#output\_memorystore\_cluster\_cpu\_utilization) | n/a |
73+
| <a name="output_memorystore_cluster_memory_utilization"></a> [memorystore\_cluster\_memory\_utilization](#output\_memorystore\_cluster\_memory\_utilization) | n/a |
74+
| <a name="output_memorystore_instance_cpu_utilization"></a> [memorystore\_instance\_cpu\_utilization](#output\_memorystore\_instance\_cpu\_utilization) | n/a |
75+
| <a name="output_memorystore_instance_memory_utilization"></a> [memorystore\_instance\_memory\_utilization](#output\_memorystore\_instance\_memory\_utilization) | n/a |
7376
| <a name="output_ssl_alert_policy_names"></a> [ssl\_alert\_policy\_names](#output\_ssl\_alert\_policy\_names) | n/a |
7477

7578
## Resources
@@ -83,6 +86,10 @@ Supported services:
8386
| [google_monitoring_alert_policy.konnectivity_agent_replicas](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8487
| [google_monitoring_alert_policy.kyverno_logmatch_alert](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8588
| [google_monitoring_alert_policy.litellm_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
89+
| [google_monitoring_alert_policy.memorystore_cluster_cpu](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
90+
| [google_monitoring_alert_policy.memorystore_cluster_memory](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
91+
| [google_monitoring_alert_policy.memorystore_instance_cpu](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
92+
| [google_monitoring_alert_policy.memorystore_instance_memory](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8693
| [google_monitoring_alert_policy.ssl_expiring_days](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8794
| [google_monitoring_alert_policy.typesense_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8895

examples/main.tf

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,4 +104,40 @@ module "example" {
104104
}
105105
}
106106
}
107+
memorystore = {
108+
enabled = true
109+
project_id = "my-gcp-project"
110+
111+
instances = {
112+
"my-redis-instance-1" = {
113+
cpu_utilization = [
114+
{
115+
severity = "WARNING"
116+
threshold = 0.80
117+
alignment_period = "300s"
118+
duration = "300s"
119+
},
120+
{
121+
severity = "CRITICAL"
122+
threshold = 0.90
123+
alignment_period = "300s"
124+
duration = "300s"
125+
}
126+
]
127+
}
128+
# Use default thresholds (memory_utilization CRITICAL at 80%)
129+
"my-redis-instance-2" = {}
130+
}
131+
132+
clusters = {
133+
"my-redis-cluster-1" = {
134+
cpu_utilization = [
135+
{
136+
threshold = 0.85
137+
duration = "600s"
138+
}
139+
]
140+
}
141+
}
142+
}
107143
}

examples/test.tfvars

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ notification_channels = [
33
"cloud_support_email",
44
"slack-channel"
55
]
6+

0 commit comments

Comments
 (0)