Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,18 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [0.9.0] - 2025-12-15

[Compare with previous version](https://github.com/sparkfabrik/terraform-google-services-monitoring/compare/0.8.0...0.9.0)

### Added

- Add `notification_prompts` param for LiteLLM and Typesense

### Changed

- Modify the default values of the pod restart alerts `duration` and `alignment_period`
Comment thread
FabrizioCafolla marked this conversation as resolved.

## [0.8.0] - 2025-12-12

[Compare with previous version](https://github.com/sparkfabrik/terraform-google-services-monitoring/compare/0.7.0...0.8.0)
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,11 @@ Supported services:
| <a name="input_cert_manager"></a> [cert\_manager](#input\_cert\_manager) | Configuration for cert-manager missing issuer log alert. Allows customization of project, cluster, namespace, notification channels, alert documentation, enablement, extra filters, auto-close timing, and notification rate limiting. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = string<br/> project_id = optional(string, null)<br/> namespace = optional(string, "cert-manager")<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> })</pre> | n/a | yes |
| <a name="input_cloud_sql"></a> [cloud\_sql](#input\_cloud\_sql) | Configuration for Cloud SQL monitoring alerts. Supports customization of project, auto-close timing, notification channels, and per-instance alert thresholds for CPU, memory, and disk utilization. | <pre>object({<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "120s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> threshold = 0.85,<br/> duration = "1200s",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 1,<br/> duration = "300s",<br/> alignment_period = "60s",<br/> }<br/> ])<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> disk_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.85)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "600s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | n/a | yes |
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = string<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> namespace = optional(string, "kyverno")<br/> })</pre> | n/a | yes |
| <a name="input_litellm"></a> [litellm](#input\_litellm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
| <a name="input_litellm"></a> [litellm](#input\_litellm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 180)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_prompts = optional(list(string), ["OPENED", "CLOSED"])<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channel IDs to notify when an alert is triggered | `list(string)` | `[]` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The Google Cloud project ID where logging exclusions will be created | `string` | n/a | yes |
| <a name="input_ssl_alert"></a> [ssl\_alert](#input\_ssl\_alert) | Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels. | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> threshold_days = optional(list(number), [15, 7])<br/> user_labels = optional(map(string), {})<br/> })</pre> | `{}` | no |
| <a name="input_typesense"></a> [typesense](#input\_typesense) | Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null) # GKE cluster name for container checks<br/><br/> # Apps configuration - map keyed by app_name<br/> apps = optional(map(object({<br/> # Uptime check configuration (optional)<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/readyz")<br/> }), null)<br/><br/> # Container check configuration for GKE (optional)<br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 0)<br/> auto_close_seconds = optional(number, 3600)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
| <a name="input_typesense"></a> [typesense](#input\_typesense) | Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/readyz")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 180)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_prompts = optional(list(string), ["OPENED", "CLOSED"])<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |

## Outputs

Expand Down
43 changes: 38 additions & 5 deletions examples/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,13 @@ locals {
}

module "example" {
source = "github.com/sparkfabrik/terraform-google-services-monitoring"
version = ">= 0.1.0"
source = "github.com/sparkfabrik/terraform-google-services-monitoring?ref=0.9.0"

notification_channels = var.notification_channels
project_id = var.project_id
cloud_sql = local.cloud_sql
kyverno = {
cluster_name = "test-cluster"
enabled = true
notification_channels = []
# Optional filter for log entries, exclude known non-actionable messages
# e.g., "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
Expand All @@ -59,7 +57,42 @@ module "example" {
cert_manager = {
cluster_name = "test-cluster"
namespace = "cert-manager"
enabled = true
notification_channels = []
}

typesense = {
cluster_name = "test-cluster"
apps = {
"typesense-app" = {
uptime_check = {
host = "typesense.example.com"
}
container_check = {
enabled = true
namespace = "typesense"
pod_restart = {
threshold = 1
}
}
}
}
}

litellm = {
cluster_name = "test-cluster"
apps = {
"litellm-app" = {
uptime_check = {
host = "litellm.example.com"
}
container_check = {
namespace = "litellm"
pod_restart = {
threshold = 2
duration = 300
notification_prompts = ["CLOSED"]
}
}
}
}
}
}
1 change: 1 addition & 0 deletions lite_llm.tf
Original file line number Diff line number Diff line change
Expand Up @@ -75,5 +75,6 @@ resource "google_monitoring_alert_policy" "litellm_pod_restart" {

alert_strategy {
auto_close = "${each.value.pod_restart.auto_close_seconds}s"
notification_prompts = each.value.pod_restart.notification_prompts
}
}
1 change: 1 addition & 0 deletions typesense.tf
Original file line number Diff line number Diff line change
Expand Up @@ -75,5 +75,6 @@ resource "google_monitoring_alert_policy" "typesense_pod_restart" {

alert_strategy {
auto_close = "${each.value.pod_restart.auto_close_seconds}s"
notification_prompts = each.value.pod_restart.notification_prompts
}
}
11 changes: 5 additions & 6 deletions variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -107,26 +107,24 @@ variable "typesense" {
project_id = optional(string, null)
notification_enabled = optional(bool, true)
notification_channels = optional(list(string), [])
cluster_name = optional(string, null) # GKE cluster name for container checks
cluster_name = optional(string, null)

# Apps configuration - map keyed by app_name
apps = optional(map(object({
# Uptime check configuration (optional)
uptime_check = optional(object({
enabled = optional(bool, true)
host = string
path = optional(string, "/readyz")
}), null)

# Container check configuration for GKE (optional)
container_check = optional(object({
enabled = optional(bool, true)
namespace = string
pod_restart = optional(object({
threshold = optional(number, 0)
alignment_period = optional(number, 60)
duration = optional(number, 0)
duration = optional(number, 180)
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title and description mention changing the default duration from 0 to 120 seconds, but the actual code changes it to 180 seconds. This inconsistency should be corrected - either update the title/description to reflect 180 seconds, or change the code to use 120 seconds as stated.

Suggested change
duration = optional(number, 180)
duration = optional(number, 120)

Copilot uses AI. Check for mistakes.
auto_close_seconds = optional(number, 3600)
notification_prompts = optional(list(string), null)
}), {})
Comment thread
FabrizioCafolla marked this conversation as resolved.
}), null)
})), {})
Expand Down Expand Up @@ -175,8 +173,9 @@ variable "litellm" {
pod_restart = optional(object({
threshold = optional(number, 0)
alignment_period = optional(number, 60)
Comment thread
FabrizioCafolla marked this conversation as resolved.
duration = optional(number, 0)
duration = optional(number, 180)
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title and description mention changing the default duration from 0 to 120 seconds, but the actual code changes it to 180 seconds. This inconsistency should be corrected - either update the title/description to reflect 180 seconds, or change the code to use 120 seconds as stated.

Suggested change
duration = optional(number, 180)
duration = optional(number, 120)

Copilot uses AI. Check for mistakes.
auto_close_seconds = optional(number, 3600)
notification_prompts = optional(list(string), null)
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for notification_prompts in the README is documented as ["OPENED", "CLOSED"], but in the variables.tf file it's defined as null. These should be consistent. If the default should be ["OPENED", "CLOSED"], update the variable definition. If it should be null, update the README documentation.

Suggested change
notification_prompts = optional(list(string), null)
notification_prompts = optional(list(string), ["OPENED", "CLOSED"])

Copilot uses AI. Check for mistakes.
}), {})
}), null)
})), {})
Expand Down