Skip to content

Commit 76ba710

Browse files
refs platform/board#4051: add Lite LLM monitoring (#12)
1 parent 4b7f935 commit 76ba710

File tree

5 files changed

+145
-6
lines changed

5 files changed

+145
-6
lines changed

CHANGELOG.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,15 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
88

99
## [Unreleased]
1010

11-
## [0.7.0] - 2025-12-12
11+
## [0.8.0] - 2025-12-12
12+
13+
[Compare with previous version](https://github.com/sparkfabrik/terraform-google-services-monitoring/compare/0.7.0...0.8.0)
14+
15+
### Added
16+
17+
- refs platform/board#4051: add LiteLLM monitoring
18+
19+
## [0.7.0] - 2025-12-11
1220

1321
[Compare with previous version](https://github.com/sparkfabrik/terraform-google-services-monitoring/compare/0.6.0...0.7.0)
1422

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,11 @@ Supported services:
3838
| <a name="input_cert_manager"></a> [cert\_manager](#input\_cert\_manager) | Configuration for cert-manager missing issuer log alert. Allows customization of project, cluster, namespace, notification channels, alert documentation, enablement, extra filters, auto-close timing, and notification rate limiting. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = string<br/> project_id = optional(string, null)<br/> namespace = optional(string, "cert-manager")<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> })</pre> | n/a | yes |
3939
| <a name="input_cloud_sql"></a> [cloud\_sql](#input\_cloud\_sql) | Configuration for Cloud SQL monitoring alerts. Supports customization of project, auto-close timing, notification channels, and per-instance alert thresholds for CPU, memory, and disk utilization. | <pre>object({<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "120s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> threshold = 0.85,<br/> duration = "1200s",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 1,<br/> duration = "300s",<br/> alignment_period = "60s",<br/> }<br/> ])<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> disk_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.85)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "600s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | n/a | yes |
4040
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = string<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> namespace = optional(string, "kyverno")<br/> })</pre> | n/a | yes |
41+
| <a name="input_litellm"></a> [litellm](#input\_litellm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
4142
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channel IDs to notify when an alert is triggered | `list(string)` | `[]` | no |
4243
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The Google Cloud project ID where logging exclusions will be created | `string` | n/a | yes |
43-
| <a name="input_ssl_alert"></a> [ssl\_alert](#input\_ssl\_alert) | Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> threshold_days = optional(list(number), [15, 7])<br/> user_labels = optional(map(string), {})<br/> })</pre> | `{}` | no |
44-
| <a name="input_typesense"></a> [typesense](#input\_typesense) | Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). For container checks, the app name corresponds to the Kubernetes 'app' label; for apps with only uptime checks, this correspondence does not apply. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null) # GKE cluster name for container checks<br/><br/> # Apps configuration - map keyed by app_name<br/> apps = optional(map(object({<br/> # Uptime check configuration (optional)<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/readyz")<br/> }), null)<br/><br/> # Container check configuration for GKE (optional)<br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
44+
| <a name="input_ssl_alert"></a> [ssl\_alert](#input\_ssl\_alert) | Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> threshold_days = optional(list(number), [15, 7])<br/> user_labels = optional(map(string), {})<br/> })</pre> | `{}` | no |
45+
| <a name="input_typesense"></a> [typesense](#input\_typesense) | Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null) # GKE cluster name for container checks<br/><br/> # Apps configuration - map keyed by app_name<br/> apps = optional(map(object({<br/> # Uptime check configuration (optional)<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/readyz")<br/> }), null)<br/><br/> # Container check configuration for GKE (optional)<br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
4546

4647
## Outputs
4748

@@ -61,13 +62,15 @@ Supported services:
6162
| [google_monitoring_alert_policy.cloud_sql_disk_utilization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6263
| [google_monitoring_alert_policy.cloud_sql_memory_utilization](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6364
| [google_monitoring_alert_policy.kyverno_logmatch_alert](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
65+
| [google_monitoring_alert_policy.litellm_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6466
| [google_monitoring_alert_policy.ssl_expiring_days](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6567
| [google_monitoring_alert_policy.typesense_pod_restart](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
6668

6769
## Modules
6870

6971
| Name | Source | Version |
7072
|------|--------|---------|
73+
| <a name="module_litellm_uptime_checks"></a> [litellm\_uptime\_checks](#module\_litellm\_uptime\_checks) | github.com/sparkfabrik/terraform-sparkfabrik-gcp-http-monitoring | 1.0.0 |
7174
| <a name="module_typesense_uptime_checks"></a> [typesense\_uptime\_checks](#module\_typesense\_uptime\_checks) | github.com/sparkfabrik/terraform-sparkfabrik-gcp-http-monitoring | 1.0.0 |
7275

7376
<!-- END_TF_DOCS -->

lite_llm.tf

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
locals {
2+
litellm_project = var.litellm.project_id != null ? var.litellm.project_id : var.project_id
3+
4+
litellm_notification_channels = var.litellm.notification_enabled ? (length(var.litellm.notification_channels) > 0 ? var.litellm.notification_channels : var.notification_channels) : []
5+
6+
litellm_uptime_checks = var.litellm.enabled ? {
7+
for app_name, config in var.litellm.apps :
8+
app_name => config.uptime_check
9+
if config.uptime_check != null && try(config.uptime_check.enabled, false)
10+
} : {}
11+
12+
litellm_container_checks = var.litellm.enabled ? {
13+
for app_name, config in var.litellm.apps :
14+
app_name => config.container_check
15+
if config.container_check != null && try(config.container_check.enabled, false)
16+
} : {}
17+
}
18+
19+
module "litellm_uptime_checks" {
20+
for_each = local.litellm_uptime_checks
21+
22+
source = "github.com/sparkfabrik/terraform-sparkfabrik-gcp-http-monitoring?ref=1.0.0"
23+
gcp_project = local.litellm_project
24+
uptime_monitoring_host = each.value.host
25+
uptime_monitoring_path = each.value.path
26+
alert_notification_channels = local.litellm_notification_channels
27+
alert_threshold_value = 1
28+
uptime_check_period = "900s"
29+
}
30+
31+
# Alert: GKE Pod Restarts
32+
# This alert monitors the restart count of LiteLLM containers in GKE.
33+
# It triggers when the delta of restarts is greater than the threshold
34+
# within the specified alignment period.
35+
resource "google_monitoring_alert_policy" "litellm_pod_restart" {
36+
for_each = local.litellm_container_checks
37+
38+
project = local.litellm_project
39+
display_name = "LiteLLM Pod Restarts (cluster=${var.litellm.cluster_name}, namespace=${each.value.namespace}, app=${each.key})"
40+
combiner = "OR"
41+
enabled = true
42+
43+
conditions {
44+
display_name = "LiteLLM container restart count > ${each.value.pod_restart.threshold}"
45+
46+
condition_threshold {
47+
filter = <<-EOT
48+
resource.type="k8s_container"
49+
AND resource.labels.project_id="${local.litellm_project}"
50+
AND resource.labels.cluster_name="${var.litellm.cluster_name}"
51+
AND resource.labels.namespace_name="${each.value.namespace}"
52+
AND metric.type="kubernetes.io/container/restart_count"
53+
EOT
54+
55+
comparison = "COMPARISON_GT"
56+
threshold_value = each.value.pod_restart.threshold
57+
duration = "${each.value.pod_restart.duration}s"
58+
59+
aggregations {
60+
alignment_period = "${each.value.pod_restart.alignment_period}s"
61+
per_series_aligner = "ALIGN_DELTA"
62+
cross_series_reducer = "REDUCE_SUM"
63+
group_by_fields = [
64+
"metadata.user_labels.\"app.kubernetes.io/instance\"",
65+
]
66+
}
67+
68+
trigger {
69+
count = 1
70+
}
71+
}
72+
}
73+
74+
notification_channels = local.litellm_notification_channels
75+
76+
alert_strategy {
77+
auto_close = "${each.value.pod_restart.auto_close_seconds}s"
78+
}
79+
}

typesense.tf

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
locals {
32
typesense_project = var.typesense.project_id != null ? var.typesense.project_id : var.project_id
43

variables.tf

Lines changed: 52 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ variable "cert_manager" {
100100
}
101101

102102
variable "typesense" {
103-
description = "Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). For container checks, the app name corresponds to the Kubernetes 'app' label; for apps with only uptime checks, this correspondence does not apply."
103+
description = "Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key)."
104104
default = {}
105105
type = object({
106106
enabled = optional(bool, false)
@@ -152,6 +152,56 @@ variable "typesense" {
152152
}
153153
}
154154

155+
variable "litellm" {
156+
description = "Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key)."
157+
default = {}
158+
type = object({
159+
enabled = optional(bool, false)
160+
project_id = optional(string, null)
161+
notification_enabled = optional(bool, true)
162+
notification_channels = optional(list(string), [])
163+
cluster_name = optional(string, null)
164+
165+
apps = optional(map(object({
166+
uptime_check = optional(object({
167+
enabled = optional(bool, true)
168+
host = string
169+
path = optional(string, "/health/readiness")
170+
}), null)
171+
172+
container_check = optional(object({
173+
enabled = optional(bool, true)
174+
namespace = string
175+
pod_restart = optional(object({
176+
threshold = optional(number, 0)
177+
alignment_period = optional(number, 60)
178+
duration = optional(number, 0)
179+
auto_close_seconds = optional(number, 3600)
180+
}), {})
181+
}), null)
182+
})), {})
183+
})
184+
185+
validation {
186+
condition = alltrue([
187+
for app_name, config in var.litellm.apps : (
188+
trimspace(app_name) != "" &&
189+
(config.uptime_check != null ? try(trimspace(config.uptime_check.host), "") != "" : true) &&
190+
(config.container_check != null ? try(trimspace(config.container_check.namespace), "") != "" : true)
191+
)
192+
])
193+
error_message = "Each app must have a non-empty name (map key). If uptime_check is provided, 'host' must be non-empty. If container_check is provided, 'namespace' must be non-empty."
194+
}
195+
196+
validation {
197+
condition = (
198+
length([for app_name, config in var.litellm.apps : app_name if config.container_check != null]) == 0 ||
199+
try(trimspace(var.litellm.cluster_name), "") != ""
200+
)
201+
error_message = "When any app has container_check configured, 'cluster_name' must be provided at the litellm level."
202+
}
203+
}
204+
155205
variable "ssl_alert" {
156206
description = "Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels."
157207
default = {}
@@ -161,6 +211,6 @@ variable "ssl_alert" {
161211
notification_enabled = optional(bool, true)
162212
notification_channels = optional(list(string), [])
163213
threshold_days = optional(list(number), [15, 7])
164-
user_labels = optional(map(string), {})
214+
user_labels = optional(map(string), {})
165215
})
166216
}

0 commit comments

Comments
 (0)