Skip to content

Commit d749c60

Browse files
committed
feat(kyverno): enhance error pattern handling with inclusion/exclusion options
1 parent 7c109c7 commit d749c60

File tree

4 files changed

+83
-40
lines changed

4 files changed

+83
-40
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Supported services:
4040

4141
| Name | Version |
4242
|------|---------|
43-
| <a name="provider_google"></a> [google](#provider\_google) | >= 5.10 |
43+
| <a name="provider_google"></a> [google](#provider\_google) | 7.15.0 |
4444

4545
## Requirements
4646

@@ -56,7 +56,7 @@ Supported services:
5656
| <a name="input_cert_manager"></a> [cert\_manager](#input\_cert\_manager) | Configuration for cert-manager missing issuer log alert. Allows customization of project, cluster, namespace, notification channels, alert documentation, enablement, extra filters, auto-close timing, and notification rate limiting. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> namespace = optional(string, "cert-manager")<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> })</pre> | `{}` | no |
5757
| <a name="input_cloud_sql"></a> [cloud\_sql](#input\_cloud\_sql) | Configuration for Cloud SQL monitoring alerts. Supports customization of project, auto-close timing, notification channels, and per-instance alert thresholds for CPU, memory, and disk utilization. | <pre>object({<br/> enabled = optional(bool, true)<br/> project_id = optional(string, null)<br/> auto_close = optional(string, "86400s") # default 24h<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> instances = optional(map(object({<br/> cpu_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "120s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> threshold = 0.85,<br/> duration = "1200s",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 1,<br/> duration = "300s",<br/> alignment_period = "60s",<br/> }<br/> ])<br/> memory_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.90)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "300s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> disk_utilization = optional(list(object({<br/> severity = optional(string, "WARNING"),<br/> threshold = optional(number, 0.85)<br/> alignment_period = optional(string, "300s")<br/> duration = optional(string, "600s")<br/> })), [<br/> {<br/> severity = "WARNING",<br/> },<br/> {<br/> severity = "CRITICAL",<br/> threshold = 0.95,<br/> }<br/> ])<br/> })), {})<br/> })</pre> | `{}` | no |
5858
| <a name="input_konnectivity_agent"></a> [konnectivity\_agent](#input\_konnectivity\_agent) | Configuration for Konnectivity agent deployment replica alert in GKE. Triggers when there are no available replicas. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> namespace = optional(string, "kube-system")<br/> deployment_name = optional(string, "konnectivity-agent")<br/> duration_seconds = optional(number, 60)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> notification_prompts = optional(list(string), null)<br/> })</pre> | `{}` | no |
59-
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> filter_extra = optional(string, "")<br/> namespace = optional(string, "kyverno")<br/> })</pre> | `{}` | no |
59+
| <a name="input_kyverno"></a> [kyverno](#input\_kyverno) | Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, error pattern inclusions/exclusions, and namespace. | <pre>object({<br/> enabled = optional(bool, true)<br/> cluster_name = optional(string, null)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> # Rate limit for notifications, e.g. "300s" for 5 minutes, used only for log match alerts<br/> logmatch_notification_rate_limit = optional(string, "300s")<br/> alert_documentation = optional(string, null)<br/> auto_close_seconds = optional(number, 3600)<br/> namespace = optional(string, "kyverno")<br/> # List of error patterns to exclude from the default set.<br/> # Default patterns available for exclusion:<br/> # "internal error", "failed calling webhook", "timeout", "client-side throttling",<br/> # "failed to run warmup", "schema not found", "failed to list resources",<br/> # "failed to watch resource", "context deadline exceeded", "is forbidden",<br/> # "cannot list resource", "cannot watch resource", "RBAC.*denied",<br/> # "failed to start watcher", "leader election lost", "unable to update .*WebhookConfiguration",<br/> # "failed to sync", "dropping request", "failed to load certificate",<br/> # "failed to update lock", "the object has been modified", "no matches for kind",<br/> # "the server could not find the requested resource", "Too Many Requests", "x509",<br/> # "is invalid:", "connection refused", "no agent available", "fatal error", "panic"<br/> error_patterns_exclude = optional(list(string), [])<br/> # List of additional error patterns to include (added to default set)<br/> # e.g. ["my custom error", "another pattern"]<br/> error_patterns_include = optional(list(string), [])<br/> })</pre> | `{}` | no |
6060
| <a name="input_litellm"></a> [litellm](#input\_litellm) | Configuration for LiteLLM monitoring alerts. Supports uptime checks for HTTP endpoints and container-level alerts (pod restarts) in GKE. Each app is identified by its name (map key). | <pre>object({<br/> enabled = optional(bool, false)<br/> project_id = optional(string, null)<br/> notification_enabled = optional(bool, true)<br/> notification_channels = optional(list(string), [])<br/> cluster_name = optional(string, null)<br/><br/> apps = optional(map(object({<br/> uptime_check = optional(object({<br/> enabled = optional(bool, true)<br/> host = string<br/> path = optional(string, "/health/readiness")<br/> }), null)<br/><br/> container_check = optional(object({<br/> enabled = optional(bool, true)<br/> namespace = string<br/> pod_restart = optional(object({<br/> threshold = optional(number, 0)<br/> alignment_period = optional(number, 60)<br/> duration = optional(number, 180)<br/> auto_close_seconds = optional(number, 3600)<br/> notification_prompts = optional(list(string), null)<br/> }), {})<br/> }), null)<br/> })), {})<br/> })</pre> | `{}` | no |
6161
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channel IDs to notify when an alert is triggered | `list(string)` | `[]` | no |
6262
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | The Google Cloud project ID where logging exclusions will be created | `string` | n/a | yes |

examples/main.tf

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,16 @@ module "example" {
5050
kyverno = {
5151
cluster_name = "test-cluster"
5252
notification_channels = []
53-
# Optional filter for log entries, exclude known non-actionable messages
54-
# e.g., "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
55-
filter_extra = "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
53+
# Exclude specific error patterns from the default set
54+
error_patterns_exclude = [
55+
"failed to start watcher",
56+
"failed to list resources",
57+
]
58+
# Add custom error patterns to the default set
59+
# error_patterns_include = [
60+
# "my custom error",
61+
# "another pattern to match",
62+
# ]
5663
}
5764
cert_manager = {
5865
cluster_name = "test-cluster"

kyverno.tf

Lines changed: 54 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,58 @@ locals {
55

66
kyverno_cluster_name = var.kyverno.cluster_name != null ? trimspace(var.kyverno.cluster_name) : ""
77

8-
kyverno_log_filter = local.kyverno_cluster_name != "" ? (<<-EOT
8+
# Default error patterns for Kyverno log matching
9+
kyverno_default_error_patterns = [
10+
"internal error",
11+
"failed calling webhook",
12+
"timeout",
13+
"client-side throttling",
14+
"failed to run warmup",
15+
"schema not found",
16+
"failed to list resources",
17+
"failed to watch resource",
18+
"context deadline exceeded",
19+
"is forbidden",
20+
"cannot list resource",
21+
"cannot watch resource",
22+
"RBAC.*denied",
23+
"failed to start watcher",
24+
"leader election lost",
25+
"unable to update .*WebhookConfiguration",
26+
"failed to sync",
27+
"dropping request",
28+
"failed to load certificate",
29+
"failed to update lock",
30+
"the object has been modified",
31+
"no matches for kind",
32+
"the server could not find the requested resource",
33+
"Too Many Requests",
34+
"x509",
35+
"is invalid:",
36+
"connection refused",
37+
"no agent available",
38+
"fatal error",
39+
"panic",
40+
]
41+
42+
# Combine default patterns with included patterns, then filter out excluded ones
43+
kyverno_all_error_patterns = concat(
44+
local.kyverno_default_error_patterns,
45+
var.kyverno.error_patterns_include
46+
)
47+
48+
kyverno_active_error_patterns = [
49+
for pattern in local.kyverno_all_error_patterns :
50+
pattern if !contains(var.kyverno.error_patterns_exclude, pattern)
51+
]
52+
53+
# Build the error patterns filter string
54+
kyverno_error_patterns_filter = length(local.kyverno_active_error_patterns) > 0 ? join("\n OR ", [
55+
for pattern in local.kyverno_active_error_patterns :
56+
"jsonPayload.error=~\"(?i)${pattern}\""
57+
]) : ""
58+
59+
kyverno_log_filter = local.kyverno_cluster_name != "" && length(local.kyverno_active_error_patterns) > 0 ? (<<-EOT
960
resource.type="k8s_container"
1061
AND resource.labels.project_id="${local.kyverno_project_id}"
1162
AND resource.labels.cluster_name="${local.kyverno_cluster_name}"
@@ -15,38 +66,8 @@ locals {
1566
OR resource.labels.pod_name=~"kyverno-(admission|background|cleanup|reports)-controller-.*"
1667
)
1768
AND (
18-
jsonPayload.error=~"(?i)internal error"
19-
OR jsonPayload.error=~"(?i)failed calling webhook"
20-
OR jsonPayload.error=~"(?i)timeout"
21-
OR jsonPayload.error=~"(?i)client-side throttling"
22-
OR jsonPayload.error=~"(?i)failed to run warmup"
23-
OR jsonPayload.error=~"(?i)schema not found"
24-
OR jsonPayload.error=~"(?i)failed to list resources"
25-
OR jsonPayload.error=~"(?i)failed to watch resource"
26-
OR jsonPayload.error=~"(?i)context deadline exceeded"
27-
OR jsonPayload.error=~"(?i)is forbidden"
28-
OR jsonPayload.error=~"(?i)cannot list resource"
29-
OR jsonPayload.error=~"(?i)cannot watch resource"
30-
OR jsonPayload.error=~"(?i)RBAC.*denied"
31-
OR jsonPayload.error=~"(?i)failed to start watcher"
32-
OR jsonPayload.error=~"(?i)leader election lost"
33-
OR jsonPayload.error=~"(?i)unable to update .*WebhookConfiguration"
34-
OR jsonPayload.error=~"(?i)failed to sync"
35-
OR jsonPayload.error=~"(?i)dropping request"
36-
OR jsonPayload.error=~"(?i)failed to load certificate"
37-
OR jsonPayload.error=~"(?i)failed to update lock"
38-
OR jsonPayload.error=~"(?i)the object has been modified"
39-
OR jsonPayload.error=~"(?i)no matches for kind"
40-
OR jsonPayload.error=~"(?i)the server could not find the requested resource"
41-
OR jsonPayload.error=~"(?i)Too Many Requests"
42-
OR jsonPayload.error=~"(?i)x509"
43-
OR jsonPayload.error=~"(?i)is invalid:"
44-
OR jsonPayload.error=~"(?i)connection refused"
45-
OR jsonPayload.error=~"(?i)no agent available"
46-
OR jsonPayload.error=~"(?i)fatal error"
47-
OR jsonPayload.error=~"(?i)panic"
69+
${local.kyverno_error_patterns_filter}
4870
)
49-
${trimspace(var.kyverno.filter_extra)}
5071
EOT
5172
) : ""
5273
}
@@ -55,6 +76,7 @@ resource "google_monitoring_alert_policy" "kyverno_logmatch_alert" {
5576
count = (
5677
var.kyverno.enabled
5778
&& local.kyverno_cluster_name != ""
79+
&& length(local.kyverno_active_error_patterns) > 0
5880
) ? 1 : 0
5981

6082
display_name = "Kyverno controllers ERROR logs (namespace=${var.kyverno.namespace})"

variables.tf

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ variable "cloud_sql" {
6969
}
7070

7171
variable "kyverno" {
72-
description = "Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, extra filters, and namespace."
72+
description = "Configuration for Kyverno monitoring alerts. Allows customization of cluster name, project, notification channels, alert documentation, metric thresholds, auto-close timing, enablement, error pattern inclusions/exclusions, and namespace."
7373
default = {}
7474
type = object({
7575
enabled = optional(bool, true)
@@ -81,8 +81,22 @@ variable "kyverno" {
8181
logmatch_notification_rate_limit = optional(string, "300s")
8282
alert_documentation = optional(string, null)
8383
auto_close_seconds = optional(number, 3600)
84-
filter_extra = optional(string, "")
85-
namespace = optional(string, "kyverno")
84+
namespace = optional(string, "kyverno")
85+
# List of error patterns to exclude from the default set.
86+
# Default patterns available for exclusion:
87+
# "internal error", "failed calling webhook", "timeout", "client-side throttling",
88+
# "failed to run warmup", "schema not found", "failed to list resources",
89+
# "failed to watch resource", "context deadline exceeded", "is forbidden",
90+
# "cannot list resource", "cannot watch resource", "RBAC.*denied",
91+
# "failed to start watcher", "leader election lost", "unable to update .*WebhookConfiguration",
92+
# "failed to sync", "dropping request", "failed to load certificate",
93+
# "failed to update lock", "the object has been modified", "no matches for kind",
94+
# "the server could not find the requested resource", "Too Many Requests", "x509",
95+
# "is invalid:", "connection refused", "no agent available", "fatal error", "panic"
96+
error_patterns_exclude = optional(list(string), [])
97+
# List of additional error patterns to include (added to default set)
98+
# e.g. ["my custom error", "another pattern"]
99+
error_patterns_include = optional(list(string), [])
86100
})
87101

88102
validation {

0 commit comments

Comments
 (0)