Prevent disruption of core services using safety guardrails by Ayushdk · Pull Request #129 · krkn-chaos/krkn-ai

Ayushdk · 2026-01-28T14:49:24Z

User description

Fixes #25

This PR introduces safety guardrails to prevent Krkn-AI from disrupting
critical services it depends on (e.g. Prometheus, monitoring stack).

Key changes:

Default SafetyConfig applied even when user does not specify safety section
Cluster components marked disabled instead of removed
Genetic algorithm only operates on safe, active components
Added unit test validating namespace exclusion

This ensures observation layer remains available during chaos experiments.

PR Type

Bug fix, Enhancement

Description

Implements safety guardrails to prevent disruption of critical Kubernetes services
Adds SafetyConfig model with namespace, pod, and node exclusion rules
Applies safety configuration automatically during config initialization
Filters out protected components before genetic algorithm operates
Includes unit test validating namespace exclusion functionality

Diagram Walkthrough

flowchart LR
  A["Config File"] -->|"read_config_from_file"| B["Parsed Config"]
  B -->|"apply_safety"| C["Mark Protected Components Disabled"]
  C -->|"get_active_components"| D["Active Components Only"]
  D -->|"Genetic Algorithm"| E["Safe Chaos Experiments"]

File Walkthrough

Relevant files

Enhancement

safety.py `New SafetyConfig model for protection rules` krkn_ai/models/safety.py New file defining `SafetyConfig` model with safety guardrails Default excluded namespaces: `kube-system`, `kube-public`, `kube-node-lease` Support for excluding pods by labels and name patterns Support for excluding nodes by labels	+21/-0
cluster_components.py `Add apply_safety method to ClusterComponents` krkn_ai/models/cluster_components.py Added `apply_safety()` method to mark protected components as disabled Implements namespace pattern matching using `fnmatch` Implements pod exclusion by labels and regex name patterns Implements node exclusion by labels with debug logging	+57/-0
config.py `Integrate SafetyConfig into ConfigFile model` krkn_ai/models/config.py Added import for `SafetyConfig` model Added `safety` field to `ConfigFile` with default factory Safety configuration now part of main configuration structure	+3/-1
cmd.py `Apply safety guardrails during config initialization` krkn_ai/cli/cmd.py Added safety configuration application after config parsing Calls `apply_safety()` on cluster components with parsed safety config Filters to active components before algorithm execution Added debug logging to verify protected namespaces	+10/-0

Tests

test_safety.py `Unit test for safety configuration exclusion` tests/test_safety.py New test file validating safety guardrail functionality Tests that default `SafetyConfig` excludes `kube-system` namespace Verifies `get_active_components()` filters protected namespaces correctly	+39/-0

Signed-off-by: Ayush Nagar <ayushnagar2310@gmail.com>

qodo-code-review · 2026-01-28T14:50:33Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance

⚪

Regex DoS

Description: User-configurable regexes in safety.excluded_pod_name_patterns are applied via
re.match(pattern, pod.name) and could cause catastrophic backtracking (ReDoS) and CPU
exhaustion if untrusted or poorly-bounded patterns are supplied.
cluster_components.py [94-145]

Referred Code

def apply_safety(self, safety: SafetyConfig) -> None:
    """
    Apply safety rules by marking protected components as disabled.

    This ensures Krkn-AI never disrupts its own dependencies
    (Prometheus, monitoring stack, control-plane, etc).
    """

    # namespaces & their children
    for namespace in self.namespaces:
        # namespace exclusion
        for pattern in safety.excluded_namespaces:
            if fnmatch.fnmatch(namespace.name, pattern):
                namespace.disabled = True
                logger.info(f"🛡️  Protected namespace: {namespace.name}")
                break

        # skip pod-level checks if namespace disabled
        if namespace.disabled:
            continue



 ... (clipped 31 lines)

Information disclosure

Description: The CLI logs all remaining namespace names after safety filtering, which may expose
cluster topology/resource names into logs accessible to broader audiences than intended. cmd.py [81-89]

Referred Code

logger.info("Applying safety configuration...")
parsed_config.cluster_components.apply_safety(parsed_config.safety)
parsed_config.cluster_components = (
    parsed_config.cluster_components.get_active_components()
)
# 🔍 DEBUG: verify safety worked
logger.info("Remaining namespaces after safety:")
for ns in parsed_config.cluster_components.namespaces:
    logger.info(" - %s", ns.name)

Ticket Compliance

🟡

🎫 #25

🟢	Prevent disruption of core/dependency services (e.g., Prometheus/monitoring/control-plane) by implementing an exclusion mechanism so they are not selected for disruption.
⚪	Alternatively, implement a retry mechanism when calling external services to capture results reliably.

Codebase Duplication Compliance

⚪

Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance

🟢

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Unhandled regex errors: apply_safety() uses re.match(pattern, pod.name) on externally supplied patterns without
validating/compiling them or catching re.error, which can crash execution on invalid
patterns.

Referred Code

# name pattern exclusion
for pattern in safety.excluded_pod_name_patterns:
    if re.match(pattern, pod.name):
        pod.disabled = True
        logger.debug(
            f"🛡️  Protected pod by pattern: {pod.name}"
        )
        break

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated external patterns: Safety exclusion inputs (e.g., excluded_pod_name_patterns regex strings and label
expressions) are accepted and executed without validation/sanitization, enabling malformed
patterns or potentially expensive regex evaluation from config input.

Referred Code

# pods
for pod in namespace.pods:
    # label-based exclusion
    for label in safety.excluded_pod_labels:
        if "=" in label:
            key, value = label.split("=", 1)
            if pod.labels.get(key) == value:
                pod.disabled = True
                logger.debug(
                    f"🛡️  Protected pod by label: {pod.name}"
                )
                break

    # name pattern exclusion
    for pattern in safety.excluded_pod_name_patterns:
        if re.match(pattern, pod.name):
            pod.disabled = True
            logger.debug(
                f"🛡️  Protected pod by pattern: {pod.name}"
            )
            break

Learn more about managing compliance generic rules or creating your own custom rules

⚪

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing audit context: New safety-related actions are logged without audit-trail context such as actor/user
identity and explicit outcome fields, making it unclear whether audit requirements are
met.

Referred Code

def apply_safety(self, safety: SafetyConfig) -> None:
    """
    Apply safety rules by marking protected components as disabled.

    This ensures Krkn-AI never disrupts its own dependencies
    (Prometheus, monitoring stack, control-plane, etc).
    """

    # namespaces & their children
    for namespace in self.namespaces:
        # namespace exclusion
        for pattern in safety.excluded_namespaces:
            if fnmatch.fnmatch(namespace.name, pattern):
                namespace.disabled = True
                logger.info(f"🛡️  Protected namespace: {namespace.name}")
                break

        # skip pod-level checks if namespace disabled
        if namespace.disabled:
            continue



 ... (clipped 29 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Unstructured debug logging: New logs print potentially sensitive environment details (namespace names) using free-form
strings (including emojis) rather than structured fields, so it is unclear whether logging
remains structured and policy-compliant.

Referred Code

logger.info("Applying safety configuration...")
parsed_config.cluster_components.apply_safety(parsed_config.safety)
parsed_config.cluster_components = (
    parsed_config.cluster_components.get_active_components()
)
# 🔍 DEBUG: verify safety worked
logger.info("Remaining namespaces after safety:")
for ns in parsed_config.cluster_components.namespaces:
    logger.info(" - %s", ns.name)

Learn more about managing compliance generic rules or creating your own custom rules

Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2026-01-28T14:51:56Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Expand default safety rules for monitoring Expand the default `excluded_namespaces` in `SafetyConfig` to include common monitoring namespaces like `monitoring` and `openshift-monitoring`. This change provides better default protection for the observation layer. Examples: krkn_ai/models/safety.py [11-15] excluded_namespaces: List[str] = [ "kube-system", "kube-public", "kube-node-lease", ] Solution Walkthrough: Before: # In krkn_ai/models/safety.py class SafetyConfig(BaseModel): """ Safety guardrails to prevent Krkn-AI from disrupting services it depends on... """ excluded_namespaces: List[str] = [ "kube-system", "kube-public", "kube-node-lease", ] # ... After: # In krkn_ai/models/safety.py class SafetyConfig(BaseModel): """ Safety guardrails to prevent Krkn-AI from disrupting services it depends on... """ excluded_namespaces: List[str] = [ "kube-system", "kube-public", "kube-node-lease", "monitoring", "openshift-monitoring", ] # ... Suggestion importance[1-10]: 8 __ Why: This suggestion directly enhances the PR's primary goal of protecting the observation layer by making the default safety configuration more robust for common monitoring setups, significantly improving its out-of-the-box effectiveness.	Medium
Possible issue	✅ ~~Handle invalid regex patterns~~ Suggestion Impact: Introduced a helper `safe_regex_match()` that wraps regex evaluation in a try/except `re.error` block, logs and skips invalid patterns, and replaced the direct `re.match()` call in pod name exclusion with this safe helper. code diff: +def safe_regex_match(pattern: str, value: str) -> bool: + """ + Safely evaluate regex patterns from user config. + Prevents crashes and avoids invalid regex execution. + """ + try: + return re.search(pattern, value) is not None + except re.error as exc: + logger.warning( + "Invalid regex pattern '%s' skipped: %s", + pattern, + exc, + ) + return False + class Container(BaseModel): name: str @@ -105,7 +121,7 @@ for pattern in safety.excluded_namespaces: if fnmatch.fnmatch(namespace.name, pattern): namespace.disabled = True - logger.info(f"🛡️ Protected namespace: {namespace.name}") + logger.info("Protected namespace excluded from chaos: %s", namespace.name) break # skip pod-level checks if namespace disabled @@ -121,16 +137,16 @@ if pod.labels.get(key) == value: pod.disabled = True logger.debug( - f"🛡️ Protected pod by label: {pod.name}" + "Protected pod by label: %s", pod.name ) break # name pattern exclusion for pattern in safety.excluded_pod_name_patterns: - if re.match(pattern, pod.name): + if safe_regex_match(pattern, pod.name): pod.disabled = True logger.debug( - f"🛡️ Protected pod by pattern: {pod.name}" + "Protected pod by pattern: %s", pod.name ) break Add error handling for regex matching in pod name exclusion. Wrap the `re.match` call in a `try/except re.error` block to prevent invalid patterns in the configuration from crashing the program. krkn_ai/models/cluster_components.py [129-135] for pattern in safety.excluded_pod_name_patterns: - if re.match(pattern, pod.name): - pod.disabled = True - logger.debug( - f"🛡️ Protected pod by pattern: {pod.name}" - ) - break + try: + if re.match(pattern, pod.name): + pod.disabled = True + logger.debug(f"🛡️ Protected pod by pattern: {pod.name}") + break + except re.error: + logger.warning(f"Ignoring invalid pod name regex: {pattern}") + continue `[Suggestion processed]` Suggestion importance[1-10]: 7 __ Why: This suggestion correctly points out that an invalid regex pattern in the configuration could crash the application. Adding a `try-except` block makes the code more robust and prevents a misconfiguration from causing a runtime error.	Medium
Possible issue	Skip name-pattern checks for disabled pods Optimize pod exclusion by adding a check to skip name-pattern matching if a pod has already been disabled by its labels. This prevents redundant processing. krkn_ai/models/cluster_components.py [116-135] for pod in namespace.pods: # label-based exclusion for label in safety.excluded_pod_labels: if "=" in label: key, value = label.split("=", 1) if pod.labels.get(key) == value: pod.disabled = True - logger.debug( - f"🛡️ Protected pod by label: {pod.name}" - ) + logger.debug(f"🛡️ Protected pod by label: {pod.name}") break + if pod.disabled: + continue # skip name-pattern exclusion for already-disabled pods # name pattern exclusion for pattern in safety.excluded_pod_name_patterns: if re.match(pattern, pod.name): pod.disabled = True - logger.debug( - f"🛡️ Protected pod by pattern: {pod.name}" - ) + logger.debug(f"🛡️ Protected pod by pattern: {pod.name}") break Apply / Chat Suggestion importance[1-10]: 4 __ Why: This is a valid optimization that avoids unnecessary work by skipping name-pattern checks for pods already disabled by label matching. It improves code efficiency, although the performance impact is likely minor.	Low
General	Enhance pod label exclusion logic Enhance pod label exclusion to support both `key=value` matching and key-only presence checks. This aligns it with Kubernetes label selectors and improves consistency with node label exclusion logic. krkn_ai/models/cluster_components.py [117-126] # label-based exclusion for label in safety.excluded_pod_labels: if "=" in label: key, value = label.split("=", 1) if pod.labels.get(key) == value: pod.disabled = True logger.debug( f"🛡️ Protected pod by label: {pod.name}" ) break + else: + if label in pod.labels: + pod.disabled = True + logger.debug( + f"🛡️ Protected pod by label key: {pod.name}" + ) + break Apply / Chat Suggestion importance[1-10]: 6 __ Why: This suggestion correctly identifies an inconsistency between pod and node label exclusion logic. Implementing this change would enhance the feature's flexibility and provide a more intuitive and consistent user experience.	Low
General	Enhance node label exclusion logic Enhance node label exclusion to support both `key=value` matching and key-only presence checks. This change increases control and aligns the logic with the pod exclusion mechanism for better consistency. krkn_ai/models/cluster_components.py [137-143] # nodes for node in self.nodes: for label in safety.excluded_node_labels: - if label in node.labels: - node.disabled = True - logger.info(f"🛡️ Protected node: {node.name}") - break + if "=" in label: + key, value = label.split("=", 1) + if node.labels.get(key) == value: + node.disabled = True + logger.info(f"🛡️ Protected node by label: {node.name}") + break + else: + if label in node.labels: + node.disabled = True + logger.info(f"🛡️ Protected node by label key: {node.name}") + break Apply / Chat Suggestion importance[1-10]: 6 __ Why: This suggestion correctly identifies an inconsistency between node and pod label exclusion logic. Implementing this change would improve the feature by providing more granular control and creating a more consistent and intuitive API for users.	Low
Update

Signed-off-by: Ayush Nagar <ayushnagar2310@gmail.com>

Prevent disruption of core services using safety guardrails

0bd9d1a

Signed-off-by: Ayush Nagar <ayushnagar2310@gmail.com>

Ayushdk requested a review from rh-rahulshetty as a code owner January 28, 2026 14:49

qodo-code-review bot added the Review effort 3/5 label Jan 28, 2026

Ayushdk added 2 commits January 28, 2026 17:43

Prevent disruption of core services via safety exclusions

6c96ba6

Signed-off-by: Ayush Nagar <ayushnagar2310@gmail.com>

style:format code using ruff

7ef4799

Signed-off-by: Ayush Nagar <ayushnagar2310@gmail.com>

Ayushdk mentioned this pull request Jan 30, 2026

added retry with backoff #133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent disruption of core services using safety guardrails#129

Prevent disruption of core services using safety guardrails#129
Ayushdk wants to merge 3 commits intokrkn-chaos:mainfrom
Ayushdk:fix/prevent-core-service-disruption

Ayushdk commented Jan 28, 2026 •

edited by qodo-code-review bot

Loading

Uh oh!

qodo-code-review bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Jan 28, 2026 •

edited

Loading

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ayushdk commented Jan 28, 2026 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ayushdk commented Jan 28, 2026 •

edited by qodo-code-review bot

Loading

qodo-code-review bot commented Jan 28, 2026 •

edited

Loading

qodo-code-review bot commented Jan 28, 2026 •

edited

Loading