Skip to content

Commit 81ce825

Browse files
update
1 parent c692102 commit 81ce825

File tree

3 files changed

+27
-27
lines changed

3 files changed

+27
-27
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,11 @@ Supported services:
3838
| <a name="input_cert_manager"></a> [cert\_manager](#input\_cert\_manager) | Configuration for cert-manager missing issuer log alert. Allows customization of project, cluster, namespace, notification channels, alert documentation, enablement, extra filters, auto-close timing, and notification rate limiting. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = string<br/> project_id = optional(string, null)<br/> namespace = optional(string, "cert-manager")<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> })</pre> | n/a | yes |
3939
| <a name="input_cloud_sql"></a> [cloud\_sql](#input\_cloud\_sql) | Configuration for Cloud SQL monitoring alerts. Supports customization of project, auto-close timing, notification channels, and per-instance alert thresholds for CPU, memory, and disk utilization. | <pre>object({<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "120s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> threshold = 0.85,<br/> duration = "1200s",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 1,<br/> duration = "300s",<br/> alignment_period = "60s",<br/> }<br/> ])<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> disk_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.85)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "600s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | n/a | yes |
4040
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = string<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> namespace = optional(string, "kyverno")<br/> })</pre> | n/a | yes |
41-
| <a name="input_lite_llm"></a> [lite\_llm](#input\_lite\_llm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). For container checks, the app name corresponds to the Kubernetes 'app' label; for apps with only uptime checks, this correspondence does not apply. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
41+
| <a name="input_litellm"></a> [litellm](#input\_litellm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
4242
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channel IDs to notify when an alert is triggered | `list(string)` | `[]` | no |
4343
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The Google Cloud project ID where logging exclusions will be created | `string` | n/a | yes |
44-
| <a name="input_ssl_alert"></a> [ssl\_alert](#input\_ssl\_alert) | Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> threshold_days = optional(list(number), [15, 7])<br/> user_labels = optional(map(string), {})<br/> })</pre> | `{}` | no |
45-
| <a name="input_typesense"></a> [typesense](#input\_typesense) | Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). For container checks, the app name corresponds to the Kubernetes 'app' label; for apps with only uptime checks, this correspondence does not apply. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null) # GKE cluster name for container checks<br/><br/> # Apps configuration - map keyed by app_name<br/> apps = optional(map(object({<br/> # Uptime check configuration (optional)<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/readyz")<br/> }), null)<br/><br/> # Container check configuration for GKE (optional)<br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
44+
| <a name="input_ssl_alert"></a> [ssl\_alert](#input\_ssl\_alert) | Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> threshold_days = optional(list(number), [15, 7])<br/> user_labels = optional(map(string), {})<br/> })</pre> | `{}` | no |
45+
| <a name="input_typesense"></a> [typesense](#input\_typesense) | Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null) # GKE cluster name for container checks<br/><br/> # Apps configuration - map keyed by app_name<br/> apps = optional(map(object({<br/> # Uptime check configuration (optional)<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/readyz")<br/> }), null)<br/><br/> # Container check configuration for GKE (optional)<br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
4646

4747
## Outputs
4848

@@ -62,15 +62,15 @@ Supported services:
6262
| [google_monitoring_alert_policy.cloud_sql_disk_utilization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6363
| [google_monitoring_alert_policy.cloud_sql_memory_utilization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6464
| [google_monitoring_alert_policy.kyverno_logmatch_alert](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
65-
| [google_monitoring_alert_policy.lite_llm_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
65+
| [google_monitoring_alert_policy.litellm_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6666
| [google_monitoring_alert_policy.ssl_expiring_days](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6767
| [google_monitoring_alert_policy.typesense_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6868

6969
## Modules
7070

7171
| Name | Source | Version |
7272
|------|--------|---------|
73-
| <a name="module_lite_llm_uptime_checks"></a> [lite\_llm\_uptime\_checks](#module\_lite\_llm\_uptime\_checks) | github.com/sparkfabrik/terraform-sparkfabrik-gcp-http-monitoring | 1.0.0 |
73+
| <a name="module_litellm_uptime_checks"></a> [litellm\_uptime\_checks](#module\_litellm\_uptime\_checks) | github.com/sparkfabrik/terraform-sparkfabrik-gcp-http-monitoring | 1.0.0 |
7474
| <a name="module_typesense_uptime_checks"></a> [typesense\_uptime\_checks](#module\_typesense\_uptime\_checks) | github.com/sparkfabrik/terraform-sparkfabrik-gcp-http-monitoring | 1.0.0 |
7575

7676
<!-- END_TF_DOCS -->

lite_llm.tf

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,29 @@
11
locals {
2-
lite_llm_project = var.lite_llm.project_id != null ? var.lite_llm.project_id : var.project_id
2+
litellm_project = var.litellm.project_id != null ? var.litellm.project_id : var.project_id
33

4-
lite_llm_notification_channels = var.lite_llm.notification_enabled ? (length(var.lite_llm.notification_channels) > 0 ? var.lite_llm.notification_channels : var.notification_channels) : []
4+
litellm_notification_channels = var.litellm.notification_enabled ? (length(var.litellm.notification_channels) > 0 ? var.litellm.notification_channels : var.notification_channels) : []
55

6-
lite_llm_uptime_checks = var.lite_llm.enabled ? {
7-
for app_name, config in var.lite_llm.apps :
6+
litellm_uptime_checks = var.litellm.enabled ? {
7+
for app_name, config in var.litellm.apps :
88
app_name => config.uptime_check
99
if config.uptime_check != null && try(config.uptime_check.enabled, false)
1010
} : {}
1111

12-
lite_llm_container_checks = var.lite_llm.enabled ? {
13-
for app_name, config in var.lite_llm.apps :
12+
litellm_container_checks = var.litellm.enabled ? {
13+
for app_name, config in var.litellm.apps :
1414
app_name => config.container_check
1515
if config.container_check != null && try(config.container_check.enabled, false)
1616
} : {}
1717
}
1818

19-
module "lite_llm_uptime_checks" {
20-
for_each = local.lite_llm_uptime_checks
19+
module "litellm_uptime_checks" {
20+
for_each = local.litellm_uptime_checks
2121

2222
source = "github.com/sparkfabrik/terraform-sparkfabrik-gcp-http-monitoring?ref=1.0.0"
23-
gcp_project = local.lite_llm_project
23+
gcp_project = local.litellm_project
2424
uptime_monitoring_host = each.value.host
2525
uptime_monitoring_path = each.value.path
26-
alert_notification_channels = local.lite_llm_notification_channels
26+
alert_notification_channels = local.litellm_notification_channels
2727
alert_threshold_value = 1
2828
uptime_check_period = "900s"
2929
}
@@ -32,11 +32,11 @@ module "lite_llm_uptime_checks" {
3232
# This alert monitors the restart count of LiteLLM containers in GKE.
3333
# It triggers when the delta of restarts is greater than the threshold
3434
# within the specified alignment period.
35-
resource "google_monitoring_alert_policy" "lite_llm_pod_restart" {
36-
for_each = local.lite_llm_container_checks
35+
resource "google_monitoring_alert_policy" "litellm_pod_restart" {
36+
for_each = local.litellm_container_checks
3737

38-
project = local.lite_llm_project
39-
display_name = "LiteLLM Pod Restarts (cluster=${var.lite_llm.cluster_name}, namespace=${each.value.namespace}, app=${each.key})"
38+
project = local.litellm_project
39+
display_name = "LiteLLM Pod Restarts (cluster=${var.litellm.cluster_name}, namespace=${each.value.namespace}, app=${each.key})"
4040
combiner = "OR"
4141
enabled = true
4242

@@ -46,8 +46,8 @@ resource "google_monitoring_alert_policy" "lite_llm_pod_restart" {
4646
condition_threshold {
4747
filter = <<-EOT
4848
resource.type="k8s_container"
49-
AND resource.labels.project_id="${local.lite_llm_project}"
50-
AND resource.labels.cluster_name="${var.lite_llm.cluster_name}"
49+
AND resource.labels.project_id="${local.litellm_project}"
50+
AND resource.labels.cluster_name="${var.litellm.cluster_name}"
5151
AND resource.labels.namespace_name="${each.value.namespace}"
5252
AND metric.type="kubernetes.io/container/restart_count"
5353
EOT
@@ -71,7 +71,7 @@ resource "google_monitoring_alert_policy" "lite_llm_pod_restart" {
7171
}
7272
}
7373

74-
notification_channels = local.lite_llm_notification_channels
74+
notification_channels = local.litellm_notification_channels
7575

7676
alert_strategy {
7777
auto_close = "${each.value.pod_restart.auto_close_seconds}s"

variables.tf

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ variable "typesense" {
152152
}
153153
}
154154

155-
variable "lite_llm" {
155+
variable "litellm" {
156156
description = "Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key)."
157157
default = {}
158158
type = object({
@@ -184,7 +184,7 @@ variable "lite_llm" {
184184

185185
validation {
186186
condition = alltrue([
187-
for app_name, config in var.lite_llm.apps : (
187+
for app_name, config in var.litellm.apps : (
188188
trimspace(app_name) != "" &&
189189
(config.uptime_check != null ? try(trimspace(config.uptime_check.host), "") != "" : true) &&
190190
(config.container_check != null ? try(trimspace(config.container_check.namespace), "") != "" : true)
@@ -195,10 +195,10 @@ variable "lite_llm" {
195195

196196
validation {
197197
condition = (
198-
length([for app_name, config in var.lite_llm.apps : app_name if config.container_check != null]) == 0 ||
199-
try(trimspace(var.lite_llm.cluster_name), "") != ""
198+
length([for app_name, config in var.litellm.apps : app_name if config.container_check != null]) == 0 ||
199+
try(trimspace(var.litellm.cluster_name), "") != ""
200200
)
201-
error_message = "When any app has container_check configured, 'cluster_name' must be provided at the lite_llm level."
201+
error_message = "When any app has container_check configured, 'cluster_name' must be provided at the litellm level."
202202
}
203203
}
204204

0 commit comments

Comments
 (0)