Skip to content

Prevent disruption of core services using safety guardrails#129

Open
Ayushdk wants to merge 3 commits intokrkn-chaos:mainfrom
Ayushdk:fix/prevent-core-service-disruption
Open

Prevent disruption of core services using safety guardrails#129
Ayushdk wants to merge 3 commits intokrkn-chaos:mainfrom
Ayushdk:fix/prevent-core-service-disruption

Conversation

@Ayushdk
Copy link
Contributor

@Ayushdk Ayushdk commented Jan 28, 2026

User description

Fixes #25

This PR introduces safety guardrails to prevent Krkn-AI from disrupting
critical services it depends on (e.g. Prometheus, monitoring stack).

Key changes:

  • Default SafetyConfig applied even when user does not specify safety section
  • Cluster components marked disabled instead of removed
  • Genetic algorithm only operates on safe, active components
  • Added unit test validating namespace exclusion

This ensures observation layer remains available during chaos experiments.


PR Type

Bug fix, Enhancement


Description

  • Implements safety guardrails to prevent disruption of critical Kubernetes services

  • Adds SafetyConfig model with namespace, pod, and node exclusion rules

  • Applies safety configuration automatically during config initialization

  • Filters out protected components before genetic algorithm operates

  • Includes unit test validating namespace exclusion functionality


Diagram Walkthrough

flowchart LR
  A["Config File"] -->|"read_config_from_file"| B["Parsed Config"]
  B -->|"apply_safety"| C["Mark Protected Components Disabled"]
  C -->|"get_active_components"| D["Active Components Only"]
  D -->|"Genetic Algorithm"| E["Safe Chaos Experiments"]
Loading

File Walkthrough

Relevant files
Enhancement
safety.py
New SafetyConfig model for protection rules                           

krkn_ai/models/safety.py

  • New file defining SafetyConfig model with safety guardrails
  • Default excluded namespaces: kube-system, kube-public, kube-node-lease
  • Support for excluding pods by labels and name patterns
  • Support for excluding nodes by labels
+21/-0   
cluster_components.py
Add apply_safety method to ClusterComponents                         

krkn_ai/models/cluster_components.py

  • Added apply_safety() method to mark protected components as disabled
  • Implements namespace pattern matching using fnmatch
  • Implements pod exclusion by labels and regex name patterns
  • Implements node exclusion by labels with debug logging
+57/-0   
config.py
Integrate SafetyConfig into ConfigFile model                         

krkn_ai/models/config.py

  • Added import for SafetyConfig model
  • Added safety field to ConfigFile with default factory
  • Safety configuration now part of main configuration structure
+3/-1     
cmd.py
Apply safety guardrails during config initialization         

krkn_ai/cli/cmd.py

  • Added safety configuration application after config parsing
  • Calls apply_safety() on cluster components with parsed safety config
  • Filters to active components before algorithm execution
  • Added debug logging to verify protected namespaces
+10/-0   
Tests
test_safety.py
Unit test for safety configuration exclusion                         

tests/test_safety.py

  • New test file validating safety guardrail functionality
  • Tests that default SafetyConfig excludes kube-system namespace
  • Verifies get_active_components() filters protected namespaces
    correctly
+39/-0   

Signed-off-by: Ayush Nagar <ayushnagar2310@gmail.com>
@qodo-code-review
Copy link

qodo-code-review bot commented Jan 28, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Regex DoS

Description: User-configurable regexes in safety.excluded_pod_name_patterns are applied via
re.match(pattern, pod.name) and could cause catastrophic backtracking (ReDoS) and CPU
exhaustion if untrusted or poorly-bounded patterns are supplied.
cluster_components.py [94-145]

Referred Code
def apply_safety(self, safety: SafetyConfig) -> None:
    """
    Apply safety rules by marking protected components as disabled.

    This ensures Krkn-AI never disrupts its own dependencies
    (Prometheus, monitoring stack, control-plane, etc).
    """

    # namespaces & their children
    for namespace in self.namespaces:
        # namespace exclusion
        for pattern in safety.excluded_namespaces:
            if fnmatch.fnmatch(namespace.name, pattern):
                namespace.disabled = True
                logger.info(f"🛡️  Protected namespace: {namespace.name}")
                break

        # skip pod-level checks if namespace disabled
        if namespace.disabled:
            continue



 ... (clipped 31 lines)
Information disclosure

Description: The CLI logs all remaining namespace names after safety filtering, which may expose
cluster topology/resource names into logs accessible to broader audiences than intended. cmd.py [81-89]

Referred Code
logger.info("Applying safety configuration...")
parsed_config.cluster_components.apply_safety(parsed_config.safety)
parsed_config.cluster_components = (
    parsed_config.cluster_components.get_active_components()
)
# 🔍 DEBUG: verify safety worked
logger.info("Remaining namespaces after safety:")
for ns in parsed_config.cluster_components.namespaces:
    logger.info(" - %s", ns.name)
Ticket Compliance
🟡
🎫 #25
🟢 Prevent disruption of core/dependency services (e.g., Prometheus/monitoring/control-plane)
by implementing an exclusion mechanism so they are not selected for disruption.
Alternatively, implement a retry mechanism when calling external services to capture
results reliably.
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Unhandled regex errors: apply_safety() uses re.match(pattern, pod.name) on externally supplied patterns without
validating/compiling them or catching re.error, which can crash execution on invalid
patterns.

Referred Code
# name pattern exclusion
for pattern in safety.excluded_pod_name_patterns:
    if re.match(pattern, pod.name):
        pod.disabled = True
        logger.debug(
            f"🛡️  Protected pod by pattern: {pod.name}"
        )
        break

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated external patterns: Safety exclusion inputs (e.g., excluded_pod_name_patterns regex strings and label
expressions) are accepted and executed without validation/sanitization, enabling malformed
patterns or potentially expensive regex evaluation from config input.

Referred Code
# pods
for pod in namespace.pods:
    # label-based exclusion
    for label in safety.excluded_pod_labels:
        if "=" in label:
            key, value = label.split("=", 1)
            if pod.labels.get(key) == value:
                pod.disabled = True
                logger.debug(
                    f"🛡️  Protected pod by label: {pod.name}"
                )
                break

    # name pattern exclusion
    for pattern in safety.excluded_pod_name_patterns:
        if re.match(pattern, pod.name):
            pod.disabled = True
            logger.debug(
                f"🛡️  Protected pod by pattern: {pod.name}"
            )
            break

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit context: New safety-related actions are logged without audit-trail context such as actor/user
identity and explicit outcome fields, making it unclear whether audit requirements are
met.

Referred Code
def apply_safety(self, safety: SafetyConfig) -> None:
    """
    Apply safety rules by marking protected components as disabled.

    This ensures Krkn-AI never disrupts its own dependencies
    (Prometheus, monitoring stack, control-plane, etc).
    """

    # namespaces & their children
    for namespace in self.namespaces:
        # namespace exclusion
        for pattern in safety.excluded_namespaces:
            if fnmatch.fnmatch(namespace.name, pattern):
                namespace.disabled = True
                logger.info(f"🛡️  Protected namespace: {namespace.name}")
                break

        # skip pod-level checks if namespace disabled
        if namespace.disabled:
            continue



 ... (clipped 29 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Unstructured debug logging: New logs print potentially sensitive environment details (namespace names) using free-form
strings (including emojis) rather than structured fields, so it is unclear whether logging
remains structured and policy-compliant.

Referred Code
logger.info("Applying safety configuration...")
parsed_config.cluster_components.apply_safety(parsed_config.safety)
parsed_config.cluster_components = (
    parsed_config.cluster_components.get_active_components()
)
# 🔍 DEBUG: verify safety worked
logger.info("Remaining namespaces after safety:")
for ns in parsed_config.cluster_components.namespaces:
    logger.info(" - %s", ns.name)

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Jan 28, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Expand default safety rules for monitoring

Expand the default excluded_namespaces in SafetyConfig to include common
monitoring namespaces like monitoring and openshift-monitoring. This change
provides better default protection for the observation layer.

Examples:

krkn_ai/models/safety.py [11-15]
    excluded_namespaces: List[str] = [
        "kube-system",
        "kube-public",
        "kube-node-lease",
    ]

Solution Walkthrough:

Before:

# In krkn_ai/models/safety.py
class SafetyConfig(BaseModel):
    """
    Safety guardrails to prevent Krkn-AI from disrupting
    services it depends on...
    """

    excluded_namespaces: List[str] = [
        "kube-system",
        "kube-public",
        "kube-node-lease",
    ]
    # ...

After:

# In krkn_ai/models/safety.py
class SafetyConfig(BaseModel):
    """
    Safety guardrails to prevent Krkn-AI from disrupting
    services it depends on...
    """

    excluded_namespaces: List[str] = [
        "kube-system",
        "kube-public",
        "kube-node-lease",
        "monitoring",
        "openshift-monitoring",
    ]
    # ...
Suggestion importance[1-10]: 8

__

Why: This suggestion directly enhances the PR's primary goal of protecting the observation layer by making the default safety configuration more robust for common monitoring setups, significantly improving its out-of-the-box effectiveness.

Medium
Possible issue
Handle invalid regex patterns
Suggestion Impact:Introduced a helper `safe_regex_match()` that wraps regex evaluation in a try/except `re.error` block, logs and skips invalid patterns, and replaced the direct `re.match()` call in pod name exclusion with this safe helper.

code diff:

+def safe_regex_match(pattern: str, value: str) -> bool:
+    """
+    Safely evaluate regex patterns from user config.
+    Prevents crashes and avoids invalid regex execution.
+    """
+    try:
+        return re.search(pattern, value) is not None
+    except re.error as exc:
+        logger.warning(
+            "Invalid regex pattern '%s' skipped: %s",
+            pattern,
+            exc,
+        )
+        return False
+
 
 class Container(BaseModel):
     name: str
@@ -105,7 +121,7 @@
             for pattern in safety.excluded_namespaces:
                 if fnmatch.fnmatch(namespace.name, pattern):
                     namespace.disabled = True
-                    logger.info(f"🛡️  Protected namespace: {namespace.name}")
+                    logger.info("Protected namespace excluded from chaos: %s", namespace.name)
                     break
 
             # skip pod-level checks if namespace disabled
@@ -121,16 +137,16 @@
                         if pod.labels.get(key) == value:
                             pod.disabled = True
                             logger.debug(
-                                f"🛡️  Protected pod by label: {pod.name}"
+                                "Protected pod by label: %s", pod.name
                             )
                             break
 
                 # name pattern exclusion
                 for pattern in safety.excluded_pod_name_patterns:
-                    if re.match(pattern, pod.name):
+                    if safe_regex_match(pattern, pod.name):
                         pod.disabled = True
                         logger.debug(
-                            f"🛡️  Protected pod by pattern: {pod.name}"
+                            "Protected pod by pattern: %s", pod.name
                         )
                         break

Add error handling for regex matching in pod name exclusion. Wrap the re.match
call in a try/except re.error block to prevent invalid patterns in the
configuration from crashing the program.

krkn_ai/models/cluster_components.py [129-135]

 for pattern in safety.excluded_pod_name_patterns:
-    if re.match(pattern, pod.name):
-        pod.disabled = True
-        logger.debug(
-            f"🛡️  Protected pod by pattern: {pod.name}"
-        )
-        break
+    try:
+        if re.match(pattern, pod.name):
+            pod.disabled = True
+            logger.debug(f"🛡️  Protected pod by pattern: {pod.name}")
+            break
+    except re.error:
+        logger.warning(f"Ignoring invalid pod name regex: {pattern}")
+        continue

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: This suggestion correctly points out that an invalid regex pattern in the configuration could crash the application. Adding a try-except block makes the code more robust and prevents a misconfiguration from causing a runtime error.

Medium
Skip name-pattern checks for disabled pods

Optimize pod exclusion by adding a check to skip name-pattern matching if a pod
has already been disabled by its labels. This prevents redundant processing.

krkn_ai/models/cluster_components.py [116-135]

 for pod in namespace.pods:
     # label-based exclusion
     for label in safety.excluded_pod_labels:
         if "=" in label:
             key, value = label.split("=", 1)
             if pod.labels.get(key) == value:
                 pod.disabled = True
-                logger.debug(
-                    f"🛡️  Protected pod by label: {pod.name}"
-                )
+                logger.debug(f"🛡️  Protected pod by label: {pod.name}")
                 break
+    if pod.disabled:
+        continue  # skip name-pattern exclusion for already-disabled pods
 
     # name pattern exclusion
     for pattern in safety.excluded_pod_name_patterns:
         if re.match(pattern, pod.name):
             pod.disabled = True
-            logger.debug(
-                f"🛡️  Protected pod by pattern: {pod.name}"
-            )
+            logger.debug(f"🛡️  Protected pod by pattern: {pod.name}")
             break
  • Apply / Chat
Suggestion importance[1-10]: 4

__

Why: This is a valid optimization that avoids unnecessary work by skipping name-pattern checks for pods already disabled by label matching. It improves code efficiency, although the performance impact is likely minor.

Low
General
Enhance pod label exclusion logic

Enhance pod label exclusion to support both key=value matching and key-only
presence checks. This aligns it with Kubernetes label selectors and improves
consistency with node label exclusion logic.

krkn_ai/models/cluster_components.py [117-126]

 # label-based exclusion
 for label in safety.excluded_pod_labels:
     if "=" in label:
         key, value = label.split("=", 1)
         if pod.labels.get(key) == value:
             pod.disabled = True
             logger.debug(
                 f"🛡️  Protected pod by label: {pod.name}"
             )
             break
+    else:
+        if label in pod.labels:
+            pod.disabled = True
+            logger.debug(
+                f"🛡️  Protected pod by label key: {pod.name}"
+            )
+            break
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: This suggestion correctly identifies an inconsistency between pod and node label exclusion logic. Implementing this change would enhance the feature's flexibility and provide a more intuitive and consistent user experience.

Low
Enhance node label exclusion logic

Enhance node label exclusion to support both key=value matching and key-only
presence checks. This change increases control and aligns the logic with the pod
exclusion mechanism for better consistency.

krkn_ai/models/cluster_components.py [137-143]

 # nodes
 for node in self.nodes:
     for label in safety.excluded_node_labels:
-        if label in node.labels:
-            node.disabled = True
-            logger.info(f"🛡️  Protected node: {node.name}")
-            break
+        if "=" in label:
+            key, value = label.split("=", 1)
+            if node.labels.get(key) == value:
+                node.disabled = True
+                logger.info(f"🛡️  Protected node by label: {node.name}")
+                break
+        else:
+            if label in node.labels:
+                node.disabled = True
+                logger.info(f"🛡️  Protected node by label key: {node.name}")
+                break
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: This suggestion correctly identifies an inconsistency between node and pod label exclusion logic. Implementing this change would improve the feature by providing more granular control and creating a more consistent and intuitive API for users.

Low
  • Update

Signed-off-by: Ayush Nagar <ayushnagar2310@gmail.com>
Signed-off-by: Ayush Nagar <ayushnagar2310@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disrupting Krkn-AI dependent services during a Krkn-AI test

1 participant