Skip to content

feat: add Kyverno monitoring alerts and update documentation#6

Merged
filippolmt merged 4 commits intomainfrom
feat/add_kyverno_monitoring
Oct 8, 2025
Merged

feat: add Kyverno monitoring alerts and update documentation#6
filippolmt merged 4 commits intomainfrom
feat/add_kyverno_monitoring

Conversation

@filippolmt
Copy link
Copy Markdown
Contributor

@filippolmt filippolmt commented Oct 7, 2025

PR Type

Enhancement


Description

  • Add Kyverno monitoring alerts for controller error logs

  • Refactor variable names from project to project_id

  • Add notification enablement controls for both services

  • Update documentation and improve code formatting


Changes walkthrough 📝

Relevant files
Enhancement
5 files
cloud-sql.tf
Refactor project variables and notification logic               
+11/-14 
main.tf
Update example with Kyverno configuration                               
+18/-8   
variables.tf
Add Kyverno variable and rename project variable                 
+23/-5   
kyverno_log_alert.tf
Add Kyverno error log monitoring alerts                                   
+50/-0   
variables.tf
Rename project variable and add Kyverno configuration       
+33/-18 
Miscellaneous
1 files
main.tf
Create empty main configuration file                                         
+1/-0     
Documentation
2 files
CHANGELOG.md
Document version 0.3.0 with Kyverno alerts                             
+7/-0     
README.md
Update documentation with Kyverno monitoring details         
+11/-5   
Configuration changes
1 files
Makefile
Add version variables for Docker tools                                     
+6/-4     
Tests
1 files
test.tfvars
Update test variables to use project_id                                   
+1/-2     

Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    sparkfabrik-ai-bot bot commented Oct 7, 2025

    PR Reviewer Guide 🔍

    (Review updated until commit 87f2420)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Missing Variable

    The code references var.kyverno.logmatch_notification_rate_limit in line 53, but this variable is not defined in the kyverno variable object in variables.tf. This will cause a Terraform error when the log match alert is created.

    period = var.kyverno.logmatch_notification_rate_limit
    Inconsistent Schema

    The kyverno variable definition in examples/variables.tf has different structure and field names compared to the main variables.tf file. This inconsistency could lead to confusion and errors when using the module.

    variable "kyverno" {
      description = "Configurazione completa del monitoraggio Kyverno"
      type = object({
        cluster_name            = string
        project_id              = optional(string, null)
        notification_channels   = optional(list(string), [])
        alert_documentation     = optional(string, "Kyverno controllers produced ERROR logs in namespace kyverno.")
        use_metric_threshold    = optional(bool, true)
        metric_threshold_count  = optional(number, 2)
        metric_lookback_minutes = optional(number, 1)
        auto_close_seconds      = optional(number, 3600)
        enabled                 = optional(bool, true)
        filter_extra            = optional(string, "")
        namespace               = optional(string, "kyverno")
      })
    }
    Breaking Change

    The variable name change from project to project_id is a breaking change that will require users to update their existing configurations. This should be clearly documented and potentially handled with backward compatibility.

    variable "project_id" {
      description = "The Google Cloud project ID where logging exclusions will be created"
      type        = string
    }

    @filippolmt filippolmt force-pushed the feat/add_kyverno_monitoring branch 2 times, most recently from 47433e6 to 00f0c52 Compare October 7, 2025 13:36
    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /description

    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /describe

    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /improve

    @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    PR Description updated to latest commit (00f0c52)

    @filippolmt filippolmt force-pushed the feat/add_kyverno_monitoring branch from 00f0c52 to f14cf79 Compare October 7, 2025 13:53
    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /describe

    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /improve

    @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    PR Description updated to latest commit (f14cf79)

    @filippolmt filippolmt force-pushed the feat/add_kyverno_monitoring branch 5 times, most recently from 1ee4b6b to 977f574 Compare October 7, 2025 15:45
    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /describe

    @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    PR Description updated to latest commit (977f574)

    @filippolmt filippolmt force-pushed the feat/add_kyverno_monitoring branch from 977f574 to 87f2420 Compare October 8, 2025 06:28
    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /describe

    @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    PR Description updated to latest commit (87f2420)

    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /review

    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /improve

    @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    Persistent review updated to latest commit 87f2420

    @filippolmt filippolmt self-assigned this Oct 8, 2025
    @filippolmt filippolmt force-pushed the feat/add_kyverno_monitoring branch 2 times, most recently from ee9d163 to 201158f Compare October 8, 2025 07:44
    @filippolmt filippolmt force-pushed the feat/add_kyverno_monitoring branch from 201158f to 583ba11 Compare October 8, 2025 07:45
    @filippolmt filippolmt force-pushed the feat/add_kyverno_monitoring branch from 1b1af58 to 40dad8f Compare October 8, 2025 12:10
    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /describe

    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /improve

    @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    sparkfabrik-ai-bot bot commented Oct 8, 2025

    PR Code Suggestions ✨

    Latest suggestions up to 40dad8f

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix undefined variable reference

    The variable var.kyverno.notification_enabled is referenced but not defined in the
    kyverno variable object. This will cause a Terraform error when the configuration is
    applied.

    kyverno_log_alert.tf [4]

    -kyverno_notification_channels = var.kyverno.notification_enabled ? (length(var.kyverno.notification_channels) > 0 ? var.kyverno.notification_channels : var.notification_channels) : []
    +kyverno_notification_channels = length(var.kyverno.notification_channels) > 0 ? var.kyverno.notification_channels : var.notification_channels
    Suggestion importance[1-10]: 10

    __

    Why: The variable var.kyverno.notification_enabled is referenced but not defined in the kyverno variable object in variables.tf, which would cause a Terraform error when applied.

    High
    Fix regex pattern syntax

    The regex pattern uses forward slashes as delimiters but Terraform's replace
    function expects the pattern as a string without delimiters. This will cause the
    replacement to fail and potentially create invalid metric names.

    kyverno_log_alert.tf [19-22]

     kyverno_metric_name = lower(replace(
       "kyverno_error_logs_count_${var.kyverno.cluster_name}_${var.kyverno.namespace}",
    -  "/[^a-zA-Z0-9_]/", "_"
    +  "[^a-zA-Z0-9_]", "_"
     ))
    Suggestion importance[1-10]: 10

    __

    Why: The regex pattern incorrectly uses forward slash delimiters which Terraform's replace function doesn't support, causing the replacement to fail and potentially creating invalid metric names.

    High
    General
    Simplify cluster name validation

    The count condition checks if cluster_name is not empty after trimming, but an empty
    string would still pass validation. Consider adding a validation rule or using a
    more robust check to prevent resource creation with invalid cluster names.

    kyverno_log_alert.tf [26-30]

     resource "google_monitoring_alert_policy" "kyverno_logmatch_alert" {
       count = (
         var.kyverno.enabled
         && !var.kyverno.use_metric_threshold
    -    && trimspace(var.kyverno.cluster_name) != ""
    +    && var.kyverno.cluster_name != ""
       ) ? 1 : 0
    Suggestion importance[1-10]: 3

    __

    Why: The suggestion simplifies the validation but trimspace() is actually more robust as it handles whitespace-only strings, so the original code is better.

    Low

    Previous suggestions

    Suggestions up to commit 87f2420
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix threshold comparison logic

    Using threshold_value = var.kyverno.metric_threshold_count - 1 with COMPARISON_GT
    creates an off-by-one error. To trigger when count reaches the threshold, use the
    threshold value directly with COMPARISON_GE.

    kyverno_log_alert.tf [87-93]

     resource "google_monitoring_alert_policy" "kyverno_metric_threshold_alert" {
       count = (
         var.kyverno.enabled
         && var.kyverno.use_metric_threshold
         && trimspace(var.kyverno.cluster_name) != ""
       ) ? 1 : 0
       ...
       conditions {
         display_name = "Kyverno ERROR rate alert >= ${var.kyverno.metric_threshold_count} logs in ${var.kyverno.metric_lookback_minutes} min (namespace ${var.kyverno.namespace})"
         condition_threshold {
           ...
    -      threshold_value = var.kyverno.metric_threshold_count - 1
    +      comparison      = "COMPARISON_GE"
    +      threshold_value = var.kyverno.metric_threshold_count
         }
       }
     }
    Suggestion importance[1-10]: 8

    __

    Why: This is a critical logic error. Using threshold_value = var.kyverno.metric_threshold_count - 1 with COMPARISON_GT creates an off-by-one error that would trigger alerts incorrectly.

    Medium
    General
    Fix filter concatenation syntax

    The filter concatenation with ${trimspace(var.kyverno.filter_extra)} could create
    invalid log filter syntax if filter_extra is not empty and doesn't start with a
    logical operator. Add proper spacing and conditional logic.

    kyverno_log_alert.tf [6-17]

     locals {
       kyverno_log_filter = <<-EOT
         resource.type="k8s_container"
         resource.labels.project_id="${local.kyverno_project_id}"
         resource.labels.cluster_name="${var.kyverno.cluster_name}"
         resource.labels.namespace_name="${var.kyverno.namespace}"
         severity>=ERROR
         (
           labels."k8s-pod/app_kubernetes_io/component"=~"(admission-controller|background-controller|cleanup-controller|reports-controller)"
           OR resource.labels.pod_name=~"kyverno-(admission|background|cleanup|reports)-controller-.*"
         )
    -    ${trimspace(var.kyverno.filter_extra)}
    +    ${var.kyverno.filter_extra != "" ? "\n${trimspace(var.kyverno.filter_extra)}" : ""}
       EOT
     }
    Suggestion importance[1-10]: 6

    __

    Why: Valid concern about potential syntax issues when concatenating filter_extra, though the current implementation with trimspace() may work in most cases. The suggested improvement adds better conditional handling.

    Low
    Suggestions up to commit f14cf79
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Remove undefined notification rate limit

    The notification_rate_limit block references var.kyverno.notification_rate_limit but
    this field is not defined in the variable schema. This will cause a Terraform error
    when the configuration is applied.

    kyverno_log_alert.tf [50-55]

     alert_strategy {
       auto_close = "${var.kyverno.auto_close_seconds}s"
    -  notification_rate_limit {
    -    period = var.kyverno.notification_rate_limit
    -  }
     }
    Suggestion importance[1-10]: 9

    __

    Why: This is a critical error that would cause Terraform to fail. The notification_rate_limit field is referenced but not defined in the kyverno variable schema, making this configuration invalid.

    High
    Fix variable name mismatch

    The variable name kyverno_log_alert_settings doesn't match the expected variable
    name kyverno defined in the module's variables.tf. This mismatch will cause
    Terraform to fail during plan/apply.

    examples/main.tf [51-60]

    -kyverno_log_alert_settings = {
    +kyverno = {
       cluster_name           = "test-cluster"
       enabled                = true
       use_metric_threshold   = true
       metric_threshold_count = 5
       notification_channels  = []
       # Optional filter for log entries, exclude known non-actionable messages
       # e.g., "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
       filter_extra = "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
     }
    Suggestion importance[1-10]: 9

    __

    Why: This is a critical error where kyverno_log_alert_settings is used instead of kyverno, which would cause Terraform to fail with an unknown argument error.

    High
    Update parameter name to project_id

    The module call uses project = var.project but the module now expects project_id
    instead of project based on the variable changes. This will cause Terraform to fail
    with an unknown argument error.

    examples/main.tf [49]

     module "example" {
       source  = "github.com/sparkfabrik/terraform-google-services-monitoring"
       version = ">= 0.1.0"
     
       notification_channels = var.notification_channels
    -  project               = var.project
    +  project_id            = var.project
       cloud_sql             = local.cloud_sql
       kyverno = {
         cluster_name           = "test-cluster"
         enabled                = true
         use_metric_threshold   = true
         metric_threshold_count = 5
         notification_channels  = []
         # Optional filter for log entries, exclude known non-actionable messages
         # e.g., "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
         filter_extra = "-textPayload:\"stale GroupVersion discovery: metrics.k8s.io/v1beta1\""
       }
     }
    Suggestion importance[1-10]: 9

    __

    Why: This is a critical error where the module parameter project should be project_id based on the variable changes in the PR, which would cause Terraform to fail.

    High
    Suggestions up to commit 00f0c52
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Use consistent notification channel reference

    The alert policy uses var.kyverno.notification_channels directly instead of the
    local variable local.kyverno_notification_channels which handles fallback to default
    channels. This inconsistency could result in missing notifications when no specific
    channels are configured.

    kyverno_log_alert.tf [46]

     resource "google_monitoring_alert_policy" "kyverno_logmatch_alert" {
       count = (
         var.kyverno.enabled
         && !var.kyverno.use_metric_threshold
         && trimspace(var.kyverno.cluster_name) != ""
       ) ? 1 : 0
       ...
    -  notification_channels = var.kyverno.notification_channels
    +  notification_channels = local.kyverno_notification_channels
       ...
     }
    Suggestion importance[1-10]: 7

    __

    Why: The suggestion correctly identifies an inconsistency where var.kyverno.notification_channels is used instead of local.kyverno_notification_channels, which could result in missing fallback to default channels when no specific channels are configured.

    Medium
    General
    Fix threshold comparison logic

    The threshold calculation subtracts a small value (0.000001) from the configured
    count, which could cause unexpected behavior when the count is exactly at the
    threshold. Use COMPARISON_GTE instead of COMPARISON_GT for clearer threshold
    semantics.

    kyverno_log_alert.tf [90-91]

     resource "google_monitoring_alert_policy" "kyverno_metric_threshold_alert" {
       ...
       conditions {
         display_name = "Kyverno ERROR logs >= ${var.kyverno.metric_threshold_count} in ${var.kyverno.metric_lookback_minutes}m"
         condition_threshold {
           filter = format(
             "metric.type=\"logging.googleapis.com/user/%s\" resource.type=\"global\"",
             google_logging_metric.kyverno_error_metric[0].name
           )
           ...
    -      threshold_value = var.kyverno.metric_threshold_count - 0.000001
    +      comparison      = "COMPARISON_GTE"
    +      threshold_value = var.kyverno.metric_threshold_count
           ...
         }
       }
       ...
     }
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion addresses a potential clarity issue with the threshold logic by recommending COMPARISON_GTE instead of COMPARISON_GT with a small subtraction, making the threshold behavior more explicit and predictable.

    Low
    Suggestions up to commit 75c592f
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix threshold comparison precision issue

    The threshold calculation var.kyverno.metric_threshold_count - 0.000001 is fragile
    and could cause precision issues. Use integer comparison with COMPARISON_GTE instead
    of subtracting a small decimal from the threshold.

    kyverno_log_alert.tf [103-104]

     resource "google_monitoring_alert_policy" "kyverno_metric_threshold_alert" {
       count = (
         var.kyverno.enabled
         && var.kyverno.use_metric_threshold
         && trimspace(var.kyverno.cluster_name) != ""
       ) ? 1 : 0
       ...
       conditions {
         display_name = "Kyverno ERROR logs >= ${var.kyverno.metric_threshold_count} in ${var.kyverno.metric_lookback_minutes}m"
         condition_threshold {
           ...
    -      threshold_value = var.kyverno.metric_threshold_count - 0.000001
    +      comparison      = "COMPARISON_GTE"
    +      threshold_value = var.kyverno.metric_threshold_count
           ...
         }
       }
       ...
     }
    Suggestion importance[1-10]: 6

    __

    Why: The suggestion correctly identifies a potential precision issue with floating-point arithmetic. Using COMPARISON_GTE with the exact integer value is cleaner and more reliable than subtracting a small decimal.

    Low
    Add null check for cluster_name

    The condition checks if cluster_name is not empty after trimming, but doesn't
    validate if the variable itself is null. Add null check to prevent potential runtime
    errors.

    kyverno_log_alert.tf [30-33]

     count = (
       var.kyverno.enabled
       && var.kyverno.use_metric_threshold
    +  && var.kyverno.cluster_name != null
       && trimspace(var.kyverno.cluster_name) != ""
     ) ? 1 : 0
    Suggestion importance[1-10]: 5

    __

    Why: The suggestion identifies a potential null reference issue when calling trimspace() on a null value. However, the kyverno variable definition shows cluster_name is required (not optional), making null values less likely but still worth checking defensively.

    Low
    Add null safety for auto_close

    Direct access to var.cloud_sql.auto_close may cause errors if the variable is null.
    The removed local variable cloud_sql_auto_close provided proper null handling with
    fallback logic.

    cloud-sql.tf [87]

     alert_strategy {
    -  auto_close = var.cloud_sql.auto_close
    +  auto_close = var.cloud_sql.auto_close != null ? var.cloud_sql.auto_close : "86400s"
     }
    Suggestion importance[1-10]: 4

    __

    Why: The suggestion addresses a potential null reference issue, but the cloud_sql variable definition shows auto_close has a default value of "86400s", making null values unlikely. The concern is valid but less critical given the default.

    Low

    @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    PR Description updated to latest commit (40dad8f)

    Copy link
    Copy Markdown
    Contributor

    @andypanix andypanix left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Rifai un check, ti ho lasciato dei commenti. Cmq sai che non mi è del tutto chiara la questione use_metric_threshold? perchè dovremmo volerla attivare? mi pare più un'esercizio di stile che altro 🤔

    @filippolmt
    Copy link
    Copy Markdown
    Contributor Author

    /describe

    @sparkfabrik-ai-bot
    Copy link
    Copy Markdown

    PR Description updated to latest commit (717a669)

    @filippolmt filippolmt merged commit c294e54 into main Oct 8, 2025
    1 check passed
    @filippolmt filippolmt deleted the feat/add_kyverno_monitoring branch October 8, 2025 14:47
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants