Skip to content

Commit 5d51918

Browse files
add memoy usage alert
1 parent 69d2717 commit 5d51918

File tree

6 files changed

+148
-13
lines changed

6 files changed

+148
-13
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
1414

1515
### Added
1616

17-
- refs platform/#3911: add CPU utilization monitoring alerts for Redis instances and clusters
17+
- refs platform/#3911: add CPU utilization and Memory usage monitoring alerts for Redis instances and clusters
1818

1919
## [0.14.0] - 2026-02-05
2020

README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Supported services:
1111

1212
- Memorystore
1313
- CPU utilization alerts for Redis instances and Redis clusters
14+
- Memory (system memory usage ratio) alerts for Redis instances and Redis clusters
1415

1516
- Kyverno
1617
- Error logs for admission-controller, background-controller, cleanup-controller, reports-controller
@@ -55,7 +56,7 @@ Supported services:
5556
| <a name="input_konnectivity_agent"></a> [konnectivity\_agent](#input\_konnectivity\_agent) | Configuration for Konnectivity agent deployment replica alert in GKE. Triggers when there are no available replicas. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> namespace = optional(string, "kube-system")<br/> deployment_name = optional(string, "konnectivity-agent")<br/> duration_seconds = optional(number, 60)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> notification_prompts = optional(list(string), null)<br/> })</pre> | `{}` | no |
5657
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, message pattern inclusions/exclusions for jsonPayload.message matching, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> namespace = optional(string, "kyverno")<br/> # List of message patterns to exclude from the default set (matches against jsonPayload.message).<br/> # Default patterns available for exclusion:<br/> # "failed to list resources", "failed to watch resource", "failed to start watcher",<br/> # "failed to sync", "failed to run warmup", "failed to load certificate",<br/> # "failed to update lock", "failed to process request", "failed to check permissions",<br/> # "failed to scan resource", "failed to fetch data", "failed to substitute variables",<br/> # "failed calling webhook", "leader election lost", "dropping request", "panic"<br/> error_patterns_exclude = optional(list(string), [])<br/> # List of additional regex message patterns to include (added to default set)<br/> # e.g. ["failed to authenticate.", "failed to connect."]<br/> error_patterns_include = optional(list(string), [])<br/> })</pre> | `{}` | no |
5758
| <a name="input_litellm"></a> [litellm](#input\_litellm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 180)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_prompts = optional(list(string), null)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
58-
| <a name="input_memorystore"></a> [memorystore](#input\_memorystore) | Configuration for GCP Memorystore (Redis) CPU monitoring alerts. Supports both Redis instances and Redis clusters with multiple threshold levels. Each resource is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/><br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.90,<br/> }<br/> ])<br/> })), {})<br/><br/> clusters = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.90,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | `{}` | no |
59+
| <a name="input_memorystore"></a> [memorystore](#input\_memorystore) | Configuration for GCP Memorystore (Redis) CPU monitoring alerts. Supports both Redis instances and Redis clusters with multiple threshold levels. Each resource is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/><br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), []<br/> )<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.80,<br/> }<br/> ])<br/> })), {})<br/><br/> clusters = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), []<br/> )<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING")<br/> threshold = optional(number, 0.80)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.80,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | `{}` | no |
5960
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channel IDs to notify when an alert is triggered | `list(string)` | `[]` | no |
6061
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The Google Cloud project ID where logging exclusions will be created | `string` | n/a | yes |
6162
| <a name="input_ssl_alert"></a> [ssl\_alert](#input\_ssl\_alert) | Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> threshold_days = optional(list(number), [15, 7])<br/> user_labels = optional(map(string), {})<br/> })</pre> | `{}` | no |
@@ -69,7 +70,9 @@ Supported services:
6970
| <a name="output_cloud_sql_disk_utilization"></a> [cloud\_sql\_disk\_utilization](#output\_cloud\_sql\_disk\_utilization) | n/a |
7071
| <a name="output_cloud_sql_memory_utilization"></a> [cloud\_sql\_memory\_utilization](#output\_cloud\_sql\_memory\_utilization) | n/a |
7172
| <a name="output_memorystore_cluster_cpu_utilization"></a> [memorystore\_cluster\_cpu\_utilization](#output\_memorystore\_cluster\_cpu\_utilization) | n/a |
73+
| <a name="output_memorystore_cluster_memory_utilization"></a> [memorystore\_cluster\_memory\_utilization](#output\_memorystore\_cluster\_memory\_utilization) | n/a |
7274
| <a name="output_memorystore_instance_cpu_utilization"></a> [memorystore\_instance\_cpu\_utilization](#output\_memorystore\_instance\_cpu\_utilization) | n/a |
75+
| <a name="output_memorystore_instance_memory_utilization"></a> [memorystore\_instance\_memory\_utilization](#output\_memorystore\_instance\_memory\_utilization) | n/a |
7376
| <a name="output_ssl_alert_policy_names"></a> [ssl\_alert\_policy\_names](#output\_ssl\_alert\_policy\_names) | n/a |
7477

7578
## Resources
@@ -84,7 +87,9 @@ Supported services:
8487
| [google_monitoring_alert_policy.kyverno_logmatch_alert](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8588
| [google_monitoring_alert_policy.litellm_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8689
| [google_monitoring_alert_policy.memorystore_cluster_cpu](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
90+
| [google_monitoring_alert_policy.memorystore_cluster_memory](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8791
| [google_monitoring_alert_policy.memorystore_instance_cpu](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
92+
| [google_monitoring_alert_policy.memorystore_instance_memory](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8893
| [google_monitoring_alert_policy.ssl_expiring_days](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
8994
| [google_monitoring_alert_policy.typesense_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
9095

examples/main.tf

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -125,10 +125,7 @@ module "example" {
125125
}
126126
]
127127
}
128-
"my-redis-instance-2" = {
129-
# Use default thresholds (WARNING at 80%, CRITICAL at 90%)
130-
cpu_utilization = []
131-
}
128+
"my-redis-instance-2" = {}
132129
}
133130

134131
clusters = {
@@ -137,11 +134,6 @@ module "example" {
137134
{
138135
threshold = 0.85
139136
duration = "600s"
140-
},
141-
{
142-
severity = "CRITICAL"
143-
threshold = 0.95
144-
duration = "300s"
145137
}
146138
]
147139
}

memorystore.tf

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,38 @@ locals {
3737
]
3838
) : "${item.cluster}--${item.severity}--${item.threshold}" => item
3939
} : {}
40+
41+
memorystore_instance_memory_utilization = var.memorystore.enabled ? {
42+
for item in flatten(
43+
[
44+
for instance, instance_config in var.memorystore.instances : [
45+
for memory_utilization in instance_config.memory_utilization :
46+
merge(
47+
{
48+
"instance" : instance,
49+
},
50+
memory_utilization
51+
)
52+
]
53+
]
54+
) : "${item.instance}--${item.severity}--${item.threshold}" => item
55+
} : {}
56+
57+
memorystore_cluster_memory_utilization = var.memorystore.enabled ? {
58+
for item in flatten(
59+
[
60+
for cluster, cluster_config in var.memorystore.clusters : [
61+
for memory_utilization in cluster_config.memory_utilization :
62+
merge(
63+
{
64+
"cluster" : cluster,
65+
},
66+
memory_utilization
67+
)
68+
]
69+
]
70+
) : "${item.cluster}--${item.severity}--${item.threshold}" => item
71+
} : {}
4072
}
4173

4274
# ----------------------
@@ -86,6 +118,48 @@ resource "google_monitoring_alert_policy" "memorystore_instance_cpu" {
86118
}
87119
}
88120

121+
# ----------------------
122+
# Memorystore Redis Instance Memory Utilization
123+
# ----------------------
124+
resource "google_monitoring_alert_policy" "memorystore_instance_memory" {
125+
for_each = local.memorystore_instance_memory_utilization
126+
127+
project = local.memorystore_project
128+
display_name = "Memorystore ${element(reverse(split("/", each.value.instance)), 0)} Memory utilization ${each.value.severity} > ${each.value.threshold * 100}%"
129+
combiner = "OR"
130+
severity = each.value.severity
131+
132+
conditions {
133+
condition_threshold {
134+
filter = <<-EOT
135+
resource.type = "redis_instance"
136+
AND resource.labels.instance_id = "${each.value.instance}"
137+
AND metric.type = "redis.googleapis.com/stats/memory/system_memory_usage_ratio"
138+
EOT
139+
140+
comparison = "COMPARISON_GT"
141+
threshold_value = each.value.threshold
142+
duration = each.value.duration
143+
144+
aggregations {
145+
alignment_period = each.value.alignment_period
146+
per_series_aligner = "ALIGN_MEAN"
147+
}
148+
149+
trigger {
150+
count = 1
151+
}
152+
}
153+
display_name = "Memorystore ${element(reverse(split("/", each.value.instance)), 0)} Memory utilization ${each.value.severity} > ${each.value.threshold * 100}%"
154+
}
155+
156+
notification_channels = local.memorystore_notification_channels
157+
158+
alert_strategy {
159+
auto_close = var.memorystore.auto_close
160+
}
161+
}
162+
89163
# ----------------------
90164
# Memorystore Redis Cluster CPU Utilization
91165
# ----------------------
@@ -132,3 +206,45 @@ resource "google_monitoring_alert_policy" "memorystore_cluster_cpu" {
132206
auto_close = var.memorystore.auto_close
133207
}
134208
}
209+
210+
# ----------------------
211+
# Memorystore Redis Cluster Memory Utilization
212+
# ----------------------
213+
resource "google_monitoring_alert_policy" "memorystore_cluster_memory" {
214+
for_each = local.memorystore_cluster_memory_utilization
215+
216+
project = local.memorystore_project
217+
display_name = "Memorystore ${element(reverse(split("/", each.value.cluster)), 0)} Memory utilization ${each.value.severity} > ${each.value.threshold * 100}%"
218+
combiner = "OR"
219+
severity = each.value.severity
220+
221+
conditions {
222+
condition_threshold {
223+
filter = <<-EOT
224+
resource.type = "redis_cluster"
225+
AND resource.labels.cluster_id = "${each.value.cluster}"
226+
AND metric.type = "redis.googleapis.com/cluster/stats/memory/system_memory_usage_ratio"
227+
EOT
228+
229+
comparison = "COMPARISON_GT"
230+
threshold_value = each.value.threshold
231+
duration = each.value.duration
232+
233+
aggregations {
234+
alignment_period = each.value.alignment_period
235+
per_series_aligner = "ALIGN_MEAN"
236+
}
237+
238+
trigger {
239+
count = 1
240+
}
241+
}
242+
display_name = "Memorystore ${element(reverse(split("/", each.value.cluster)), 0)} Memory utilization ${each.value.severity} > ${each.value.threshold * 100}%"
243+
}
244+
245+
notification_channels = local.memorystore_notification_channels
246+
247+
alert_strategy {
248+
auto_close = var.memorystore.auto_close
249+
}
250+
}

outputs.tf

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,18 @@ output "memorystore_instance_cpu_utilization" {
1414
value = { for k, v in google_monitoring_alert_policy.memorystore_instance_cpu : k => v.name }
1515
}
1616

17+
output "memorystore_instance_memory_utilization" {
18+
value = { for k, v in google_monitoring_alert_policy.memorystore_instance_memory : k => v.name }
19+
}
20+
1721
output "memorystore_cluster_cpu_utilization" {
1822
value = { for k, v in google_monitoring_alert_policy.memorystore_cluster_cpu : k => v.name }
1923
}
2024

25+
output "memorystore_cluster_memory_utilization" {
26+
value = { for k, v in google_monitoring_alert_policy.memorystore_cluster_memory : k => v.name }
27+
}
28+
2129
output "ssl_alert_policy_names" {
2230
value = { for days, alert in google_monitoring_alert_policy.ssl_expiring_days : days => alert.name }
2331
}

variables.tf

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -320,28 +320,42 @@ variable "memorystore" {
320320

321321
instances = optional(map(object({
322322
cpu_utilization = optional(list(object({
323+
severity = optional(string, "WARNING")
324+
threshold = optional(number, 0.80)
325+
alignment_period = optional(string, "300s")
326+
duration = optional(string, "300s")
327+
})), []
328+
)
329+
memory_utilization = optional(list(object({
323330
severity = optional(string, "WARNING")
324331
threshold = optional(number, 0.80)
325332
alignment_period = optional(string, "300s")
326333
duration = optional(string, "300s")
327334
})), [
328335
{
329336
severity = "CRITICAL",
330-
threshold = 0.90,
337+
threshold = 0.80,
331338
}
332339
])
333340
})), {})
334341

335342
clusters = optional(map(object({
336343
cpu_utilization = optional(list(object({
344+
severity = optional(string, "WARNING")
345+
threshold = optional(number, 0.80)
346+
alignment_period = optional(string, "300s")
347+
duration = optional(string, "300s")
348+
})), []
349+
)
350+
memory_utilization = optional(list(object({
337351
severity = optional(string, "WARNING")
338352
threshold = optional(number, 0.80)
339353
alignment_period = optional(string, "300s")
340354
duration = optional(string, "300s")
341355
})), [
342356
{
343357
severity = "CRITICAL",
344-
threshold = 0.90,
358+
threshold = 0.80,
345359
}
346360
])
347361
})), {})

0 commit comments

Comments
 (0)