Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,20 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
- Adjust Kyverno log filter to reduce false positives from normal transient errors such as `i/o timeout` and `failed to acquire lease`, including removal of the explicit `failed to acquire lease` condition.
- Rename error pattern `list resources failed` to `failed to list resources` for consistency with other error patterns.

### Added

- Add `error_patterns_exclude` to Kyverno configuration to allow excluding specific error patterns from the default set.
- Add `error_patterns_include` to Kyverno configuration to allow adding custom error patterns to the default set.
- Add validation for `error_patterns_exclude` to ensure only valid default patterns can be excluded.

### Breaking change

- The `filter_extra` variable has been removed and replaced with `error_patterns_include` and `error_patterns_exclude`. To migrate:
- If you were using `filter_extra` to add custom error patterns for `jsonPayload.error` matching, use `error_patterns_include` instead.
- If you need to exclude specific default error patterns, use `error_patterns_exclude`.
- **Note:** The new options only support error pattern matching against `jsonPayload.error`. If you were using `filter_extra` for arbitrary log filter conditions (e.g., negative filters like `-textPayload:"..."`), this functionality is no longer available.
- See [examples/main.tf](examples/main.tf) for usage examples.
Comment thread
filippolmt marked this conversation as resolved.

## [0.12.0] - 2026-01-28

[Compare with previous version](https://github.com/sparkfabrik/terraform-google-services-monitoring/compare/0.11.0...0.12.0)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Supported services:
| <a name="input_cert_manager"></a> [cert\_manager](#input\_cert\_manager) | Configuration for cert-manager missing issuer log alert. Allows customization of project, cluster, namespace, notification channels, alert documentation, enablement, extra filters, auto-close timing, and notification rate limiting. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> namespace = optional(string, "cert-manager")<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> })</pre> | `{}` | no |
| <a name="input_cloud_sql"></a> [cloud\_sql](#input\_cloud\_sql) | Configuration for Cloud SQL monitoring alerts. Supports customization of project, auto-close timing, notification channels, and per-instance alert thresholds for CPU, memory, and disk utilization. | <pre>object({<br/> enabled = optional(bool, true)<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "120s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> threshold = 0.85,<br/> duration = "1200s",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 1,<br/> duration = "300s",<br/> alignment_period = "60s",<br/> }<br/> ])<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> disk_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.85)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "600s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | `{}` | no |
| <a name="input_konnectivity_agent"></a> [konnectivity\_agent](#input\_konnectivity\_agent) | Configuration for Konnectivity agent deployment replica alert in GKE. Triggers when there are no available replicas. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> namespace = optional(string, "kube-system")<br/> deployment_name = optional(string, "konnectivity-agent")<br/> duration_seconds = optional(number, 60)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> notification_prompts = optional(list(string), null)<br/> })</pre> | `{}` | no |
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> namespace = optional(string, "kyverno")<br/> })</pre> | `{}` | no |
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, error pattern inclusions/exclusions for jsonPayload.error matching, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> namespace = optional(string, "kyverno")<br/> # List of error patterns to exclude from the default set.<br/> # Default patterns available for exclusion:<br/> # "internal error", "failed calling webhook", "timeout", "client-side throttling",<br/> # "failed to run warmup", "schema not found", "failed to list resources",<br/> # "failed to watch resource", "context deadline exceeded", "is forbidden",<br/> # "cannot list resource", "cannot watch resource", "RBAC.*denied",<br/> # "failed to start watcher", "leader election lost", "unable to update .*WebhookConfiguration",<br/> # "failed to sync", "dropping request", "failed to load certificate",<br/> # "failed to update lock", "the object has been modified", "no matches for kind",<br/> # "the server could not find the requested resource", "Too Many Requests", "x509",<br/> # "is invalid:", "connection refused", "no agent available", "fatal error", "panic"<br/> error_patterns_exclude = optional(list(string), [])<br/> # List of additional regex error patterns to include (added to default set)<br/> # e.g. ["my custom.*error", "failed to connect.*database"]<br/> error_patterns_include = optional(list(string), [])<br/> })</pre> | `{}` | no |
| <a name="input_litellm"></a> [litellm](#input\_litellm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 180)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_prompts = optional(list(string), null)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channel IDs to notify when an alert is triggered | `list(string)` | `[]` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The Google Cloud project ID where logging exclusions will be created | `string` | n/a | yes |
Expand Down
2 changes: 1 addition & 1 deletion cert_manager.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ locals {
EOT
)
cert_manager_notification_channels = var.cert_manager.notification_enabled ? (length(var.cert_manager.notification_channels) > 0 ? var.cert_manager.notification_channels : var.notification_channels) : []
cert_manager_cluster_name = var.cert_manager.cluster_name != null ? trimspace(var.cert_manager.cluster_name) : ""
cert_manager_cluster_name = var.cert_manager.cluster_name != null ? trimspace(var.cert_manager.cluster_name) : ""

cert_manager_log_filter = local.cert_manager_cluster_name != "" ? (<<-EOT
(
Expand Down
15 changes: 12 additions & 3 deletions examples/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,18 @@ module "example" {
kyverno = {
cluster_name = "test-cluster"
notification_channels = []
# Optional filter for log entries, exclude known non-actionable messages
# e.g., "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
filter_extra = "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
# Exclude specific error patterns from the default set (only affects jsonPayload.error matching)
error_patterns_exclude = [
"failed to start watcher",
"failed to list resources",
]
# Add custom regex error patterns to the default set (matched against jsonPayload.error)
# Note: These options only support error pattern matching. Arbitrary log filter conditions
# (e.g., negative filters like -textPayload:"...") are not supported.
# error_patterns_include = [
# "my custom.*error",
# "failed to connect.*database",
# ]
}
cert_manager = {
cluster_name = "test-cluster"
Expand Down
86 changes: 54 additions & 32 deletions kyverno.tf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,58 @@ locals {

kyverno_cluster_name = var.kyverno.cluster_name != null ? trimspace(var.kyverno.cluster_name) : ""

kyverno_log_filter = local.kyverno_cluster_name != "" ? (<<-EOT
# Default error patterns for Kyverno log matching
kyverno_default_error_patterns = [
"internal error",
"failed calling webhook",
"timeout",
"client-side throttling",
"failed to run warmup",
"schema not found",
"failed to list resources",
"failed to watch resource",
"context deadline exceeded",
"is forbidden",
"cannot list resource",
"cannot watch resource",
"RBAC.*denied",
"failed to start watcher",
"leader election lost",
"unable to update .*WebhookConfiguration",
"failed to sync",
"dropping request",
"failed to load certificate",
"failed to update lock",
"the object has been modified",
"no matches for kind",
"the server could not find the requested resource",
"Too Many Requests",
"x509",
"is invalid:",
"connection refused",
"no agent available",
"fatal error",
"panic",
]

# Combine default patterns with included patterns, then filter out excluded ones
kyverno_all_error_patterns = distinct(concat(
local.kyverno_default_error_patterns,
var.kyverno.error_patterns_include
))

kyverno_active_error_patterns = [
for pattern in local.kyverno_all_error_patterns :
pattern if !contains(var.kyverno.error_patterns_exclude, pattern)
]

# Build the error patterns filter string
kyverno_error_patterns_filter = length(local.kyverno_active_error_patterns) > 0 ? join("\n OR ", [
for pattern in local.kyverno_active_error_patterns :
"jsonPayload.error=~\"(?i)${pattern}\""
]) : ""

kyverno_log_filter = local.kyverno_cluster_name != "" && length(local.kyverno_active_error_patterns) > 0 ? (<<-EOT
resource.type="k8s_container"
AND resource.labels.project_id="${local.kyverno_project_id}"
AND resource.labels.cluster_name="${local.kyverno_cluster_name}"
Expand All @@ -15,38 +66,8 @@ locals {
OR resource.labels.pod_name=~"kyverno-(admission|background|cleanup|reports)-controller-.*"
)
AND (
jsonPayload.error=~"(?i)internal error"
OR jsonPayload.error=~"(?i)failed calling webhook"
OR jsonPayload.error=~"(?i)timeout"
OR jsonPayload.error=~"(?i)client-side throttling"
OR jsonPayload.error=~"(?i)failed to run warmup"
OR jsonPayload.error=~"(?i)schema not found"
OR jsonPayload.error=~"(?i)failed to list resources"
OR jsonPayload.error=~"(?i)failed to watch resource"
OR jsonPayload.error=~"(?i)context deadline exceeded"
OR jsonPayload.error=~"(?i)is forbidden"
OR jsonPayload.error=~"(?i)cannot list resource"
OR jsonPayload.error=~"(?i)cannot watch resource"
OR jsonPayload.error=~"(?i)RBAC.*denied"
OR jsonPayload.error=~"(?i)failed to start watcher"
OR jsonPayload.error=~"(?i)leader election lost"
OR jsonPayload.error=~"(?i)unable to update .*WebhookConfiguration"
OR jsonPayload.error=~"(?i)failed to sync"
OR jsonPayload.error=~"(?i)dropping request"
OR jsonPayload.error=~"(?i)failed to load certificate"
OR jsonPayload.error=~"(?i)failed to update lock"
OR jsonPayload.error=~"(?i)the object has been modified"
OR jsonPayload.error=~"(?i)no matches for kind"
OR jsonPayload.error=~"(?i)the server could not find the requested resource"
OR jsonPayload.error=~"(?i)Too Many Requests"
OR jsonPayload.error=~"(?i)x509"
OR jsonPayload.error=~"(?i)is invalid:"
OR jsonPayload.error=~"(?i)connection refused"
OR jsonPayload.error=~"(?i)no agent available"
OR jsonPayload.error=~"(?i)fatal error"
OR jsonPayload.error=~"(?i)panic"
${local.kyverno_error_patterns_filter}
)
${trimspace(var.kyverno.filter_extra)}
EOT
) : ""
}
Expand All @@ -55,6 +76,7 @@ resource "google_monitoring_alert_policy" "kyverno_logmatch_alert" {
count = (
var.kyverno.enabled
&& local.kyverno_cluster_name != ""
&& length(local.kyverno_active_error_patterns) > 0
) ? 1 : 0
Comment thread
filippolmt marked this conversation as resolved.

display_name = "Kyverno controllers ERROR logs (namespace=${var.kyverno.namespace})"
Expand Down
6 changes: 3 additions & 3 deletions modules/http_monitoring/main.tf
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
locals {
suffix = var.uptime_monitoring_path != "/" ? var.uptime_monitoring_path : ""
uptime_monitoring_display_name = var.uptime_monitoring_display_name != "" ? "${var.uptime_monitoring_display_name} - ${var.uptime_monitoring_host}${local.suffix}" : "${var.uptime_monitoring_host}${local.suffix}"
alert_display_name = var.alert_display_name != "" ? var.alert_display_name : "Failure of uptime check for: ${local.uptime_monitoring_display_name}"
suffix = var.uptime_monitoring_path != "/" ? var.uptime_monitoring_path : ""
uptime_monitoring_display_name = var.uptime_monitoring_display_name != "" ? "${var.uptime_monitoring_display_name} - ${var.uptime_monitoring_host}${local.suffix}" : "${var.uptime_monitoring_host}${local.suffix}"
alert_display_name = var.alert_display_name != "" ? var.alert_display_name : "Failure of uptime check for: ${local.uptime_monitoring_display_name}"
}

resource "google_monitoring_uptime_check_config" "https_uptime" {
Expand Down
Loading