Skip to content

Latest commit

 

History

History
96 lines (68 loc) · 12.3 KB

File metadata and controls

96 lines (68 loc) · 12.3 KB

Terraform GCP Services Monitoring Module

This module creates a set of monitoring alerts for Google Cloud Platform services.

Supported services:

  • Cloud SQL

    • CPU usage
    • Storage usage
    • Memory usage
  • Kyverno

    • Error logs for admission-controller, background-controller, cleanup-controller, reports-controller
  • cert-manager

    • Error logs for cert-manager controller when an Issuer or ClusterIssuer is missing
  • Konnectivity agent

    • Alert when no pods are available for the konnectivity-agent deployment
  • SSL certificate expiration

    • SSL certificate expiry alerts for monitored endpoints
  • Typesense

    • Uptime checks for HTTP endpoints
    • Pod restart alerts for Typesense containers
  • LiteLLM

    • Uptime checks for HTTP endpoints
    • Pod restart alerts for LiteLLM containers

Providers

Name Version
google >= 5.10

Requirements

Name Version
terraform >= 1.5
google >= 5.10

Inputs

Name Description Type Default Required
cert_manager Configuration for cert-manager missing issuer log alert. Allows customization of project, cluster, namespace, notification channels, alert documentation, enablement, extra filters, auto-close timing, and notification rate limiting.
object({
enabled = optional(bool, true)
cluster_name = optional(string, null)
project_id = optional(string, null)
namespace = optional(string, "cert-manager")
notification_enabled = optional(bool, true)
notification_channels = optional(list(string), [])
logmatch_notification_rate_limit = optional(string, "300s")
alert_documentation = optional(string, null)
auto_close_seconds = optional(number, 3600)
filter_extra = optional(string, "")
})
{} no
cloud_sql Configuration for Cloud SQL monitoring alerts. Supports customization of project, auto-close timing, notification channels, and per-instance alert thresholds for CPU, memory, and disk utilization.
object({
enabled = optional(bool, true)
project_id = optional(string, null)
auto_close = optional(string, "86400s") # default 24h
notification_enabled = optional(bool, true)
notification_channels = optional(list(string), [])
instances = optional(map(object({
cpu_utilization = optional(list(object({
severity = optional(string, "WARNING"),
threshold = optional(number, 0.90)
alignment_period = optional(string, "120s")
duration = optional(string, "300s")
})), [
{
threshold = 0.85,
duration = "1200s",
},
{
severity = "CRITICAL",
threshold = 1,
duration = "300s",
alignment_period = "60s",
}
])
memory_utilization = optional(list(object({
severity = optional(string, "WARNING"),
threshold = optional(number, 0.90)
alignment_period = optional(string, "300s")
duration = optional(string, "300s")
})), [
{
severity = "WARNING",
},
{
severity = "CRITICAL",
threshold = 0.95,
}
])
disk_utilization = optional(list(object({
severity = optional(string, "WARNING"),
threshold = optional(number, 0.85)
alignment_period = optional(string, "300s")
duration = optional(string, "600s")
})), [
{
severity = "WARNING",
},
{
severity = "CRITICAL",
threshold = 0.95,
}
])
})), {})
})
{} no
konnectivity_agent Configuration for Konnectivity agent deployment replica alert in GKE. Triggers when there are no available replicas.
object({
enabled = optional(bool, true)
cluster_name = optional(string, null)
project_id = optional(string, null)
namespace = optional(string, "kube-system")
deployment_name = optional(string, "konnectivity-agent")
duration_seconds = optional(number, 60)
auto_close_seconds = optional(number, 3600)
notification_enabled = optional(bool, true)
notification_channels = optional(list(string), [])
notification_prompts = optional(list(string), null)
})
{} no
kyverno Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace.
object({
enabled = optional(bool, true)
cluster_name = optional(string, null)
project_id = optional(string, null)
notification_enabled = optional(bool, true)
notification_channels = optional(list(string), [])
# Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts
logmatch_notification_rate_limit = optional(string, "300s")
alert_documentation = optional(string, null)
auto_close_seconds = optional(number, 3600)
filter_extra = optional(string, "")
namespace = optional(string, "kyverno")
})
{} no
litellm Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key).
object({
enabled = optional(bool, false)
project_id = optional(string, null)
notification_enabled = optional(bool, true)
notification_channels = optional(list(string), [])
cluster_name = optional(string, null)

apps = optional(map(object({
uptime_check = optional(object({
enabled = optional(bool, true)
host = string
path = optional(string, "/health/readiness")
}), null)

container_check = optional(object({
enabled = optional(bool, true)
namespace = string
pod_restart = optional(object({
threshold = optional(number, 0)
alignment_period = optional(number, 60)
duration = optional(number, 180)
auto_close_seconds = optional(number, 3600)
notification_prompts = optional(list(string), null)
}), {})
}), null)
})), {})
})
{} no
notification_channels List of notification channel IDs to notify when an alert is triggered list(string) [] no
project_id The Google Cloud project ID where logging exclusions will be created string n/a yes
ssl_alert Configuration for SSL certificate expiration alerts. Allows customization of project, notification channels, alert thresholds, and user labels.
object({
enabled = optional(bool, false)
project_id = optional(string, null)
notification_enabled = optional(bool, true)
notification_channels = optional(list(string), [])
threshold_days = optional(list(number), [15, 7])
user_labels = optional(map(string), {})
})
{} no
typesense Configuration for Typesense monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key).
object({
enabled = optional(bool, false)
project_id = optional(string, null)
notification_enabled = optional(bool, true)
notification_channels = optional(list(string), [])
cluster_name = optional(string, null)

apps = optional(map(object({
uptime_check = optional(object({
enabled = optional(bool, true)
host = string
path = optional(string, "/readyz")
}), null)

container_check = optional(object({
enabled = optional(bool, true)
namespace = string
pod_restart = optional(object({
threshold = optional(number, 0)
alignment_period = optional(number, 60)
duration = optional(number, 180)
auto_close_seconds = optional(number, 3600)
notification_prompts = optional(list(string), null)
}), {})
}), null)
})), {})
})
{} no

Outputs

Name Description
cloud_sql_cpu_utilization n/a
cloud_sql_disk_utilization n/a
cloud_sql_memory_utilization n/a
ssl_alert_policy_names n/a

Resources

Name Type
google_monitoring_alert_policy.cert_manager_logmatch_alert resource
google_monitoring_alert_policy.cloud_sql_cpu_utilization resource
google_monitoring_alert_policy.cloud_sql_disk_utilization resource
google_monitoring_alert_policy.cloud_sql_memory_utilization resource
google_monitoring_alert_policy.konnectivity_agent_replicas resource
google_monitoring_alert_policy.kyverno_logmatch_alert resource
google_monitoring_alert_policy.litellm_pod_restart resource
google_monitoring_alert_policy.ssl_expiring_days resource
google_monitoring_alert_policy.typesense_pod_restart resource

Modules

Name Source Version
litellm_uptime_checks ./modules/http_monitoring n/a
typesense_uptime_checks ./modules/http_monitoring n/a