Skip to content

Prevent indefinite hang during chaos execution when Kubernetes API is unreachable#124

Open
WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
WHOIM1205:fix/run-shell-timeout
Open

Prevent indefinite hang during chaos execution when Kubernetes API is unreachable#124
WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
WHOIM1205:fix/run-shell-timeout

Conversation

@WHOIM1205
Copy link

@WHOIM1205 WHOIM1205 commented Jan 26, 2026

User description

Fix: Prevent indefinite hang during chaos execution when Kubernetes API is unreachable

Summary

This PR fixes a critical issue where krkn-ai could hang indefinitely while executing chaos scenarios if the Kubernetes API becomes slow or unreachable.

The root cause was an unbounded subprocess execution in run_shell() that waited forever for command output or process exit. Under real Kubernetes failure conditions, krknctl may never return, causing krkn-ai to block permanently with no recovery.

This change introduces a bounded timeout for chaos execution and ensures failures are handled gracefully.


Problem

run_shell() executed krknctl / podman using subprocess.Popen() with:

  • No timeout
  • Blocking reads on stdout
  • A blocking process.wait()

If the Kubernetes API is unreachable or the target resources are deleted mid-execution, krknctl can hang indefinitely, causing krkn-ai to stall forever.

This occurs during exactly the kinds of instability chaos testing is designed to introduce.


Why This Is Critical

Impact before fix

  • Chaos runs hang indefinitely
  • CI/CD pipelines stall until externally killed
  • No cleanup of chaos artifacts
  • No fitness results returned (silent failure)

Affected users

  • SREs running chaos in CI pipelines
  • Reliability engineers using continuous chaos testing
  • Any user running chaos during cluster instability

Root Cause

run_shell() assumed subprocesses would always complete in bounded time.

This assumption breaks when:

  • Kubernetes API is slow or unreachable
  • Network partitions occur
  • Target nodes or namespaces are deleted mid-execution

krknctl --wait-duration does not cover the initial API connection phase, so it does not prevent this hang.


Fix

1. Add timeout support to run_shell()

  • Introduced an optional timeout parameter
  • Subprocess is terminated if it exceeds the timeout
  • Returns -1 on timeout with partial logs preserved

2. Bound chaos execution time

  • krkn_runner now passes:

PR Type

Bug fix, Tests


Description

  • Add timeout parameter to run_shell() to prevent indefinite hangs

  • Implement threading-based timeout with graceful termination and SIGKILL fallback

  • Return exit code -1 on timeout for proper error handling

  • Pass execution timeout from krkn_runner with 120-second buffer

  • Add comprehensive unit tests for timeout and edge cases


Diagram Walkthrough

flowchart LR
  A["run_shell function"] -->|"add timeout parameter"| B["Threading Timer"]
  B -->|"timeout expires"| C["Terminate process"]
  C -->|"graceful fail"| D["Return -1"]
  C -->|"force kill"| D
  E["krkn_runner"] -->|"calculate timeout"| F["wait_duration + 120s"]
  F -->|"pass to run_shell"| A
  G["Unit tests"] -->|"verify timeout behavior"| H["Test suite"]
Loading

File Walkthrough

Relevant files
Bug fix
krkn_runner.py
Add timeout parameter to chaos execution                                 

krkn_ai/chaos_engines/krkn_runner.py

  • Calculate execution timeout as wait_duration + 120 seconds buffer
  • Pass timeout parameter to run_shell() call
  • Handle return code -1 separately for timeout vs misconfiguration
  • Log timeout errors distinctly from misconfiguration failures
+19/-7   
__init__.py
Implement timeout mechanism in run_shell function               

krkn_ai/utils/init.py

  • Add optional timeout parameter to run_shell() function
  • Implement threading.Timer to enforce timeout on subprocess execution
  • Gracefully terminate process on timeout, then force kill if needed
  • Return exit code -1 when timeout occurs
  • Preserve partial logs even when timeout is triggered
  • Add comprehensive docstring with parameter descriptions
+52/-10 
Tests
test_run_shell.py
Add comprehensive unit tests for run_shell timeout             

tests/unit/utils/test_run_shell.py

  • Add 118 lines of comprehensive unit tests for run_shell() function
  • Test basic functionality: successful commands, failures, multiline
    output
  • Test timeout behavior: fast completion, timeout expiration, partial
    output
  • Test logging suppression with do_not_log parameter
  • Test edge cases: empty output, special characters, large output
+118/-0 

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@qodo-code-review
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Missing input validation: run_shell() can raise an unhandled exception on empty/whitespace command because it
unconditionally indexes command_list[0] after shlex.split().

Referred Code
def run_shell(command, do_not_log=False, timeout: Optional[int] = None):
    """
    Run shell command and get logs and statuscode in output.

    Args:
        command: Shell command to execute
        do_not_log: If True, suppress debug logging of command output
        timeout: Maximum execution time in seconds. If None, no timeout is applied.
                 If the timeout expires, the process is terminated and return code -1 is returned.

    Returns:
        Tuple of (logs, returncode). On timeout, returncode is -1.
    """
    logs = ""
    command_list = shlex.split(command)
    # Let's show the command name being executed
    logger.debug("Running command: %s", command_list[0])

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Potential sensitive logs: The new timeout/error logs and existing per-line stdout debug logging could capture
sensitive command output depending on what subprocesses emit, which cannot be validated
from the diff alone.

Referred Code
# Use a timer to enforce timeout since we're reading stdout line-by-line
timed_out = threading.Event()

def kill_on_timeout():
    timed_out.set()
    logger.error(
        "Command timed out after %d seconds, terminating: %s", timeout, command_list[0]
    )
    process.terminate()
    # Give process time to terminate gracefully, then force kill
    try:
        process.wait(timeout=5)
    except subprocess.TimeoutExpired:
        logger.warning("Process did not terminate gracefully, sending SIGKILL")
        process.kill()

timer = None
if timeout is not None:
    timer = threading.Timer(timeout, kill_on_timeout)
    timer.start()



 ... (clipped 15 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Timeout value validation: execution_timeout is derived directly from self.config.wait_duration without visible
validation (e.g., None/negative/non-int), which could lead to unexpected timeout behavior
at runtime.

Referred Code
# Timeout = wait_duration + 120 seconds buffer for initialization/teardown
execution_timeout = self.config.wait_duration + 120
log, returncode = run_shell(
    self.process_es_env_string(command, True),
    do_not_log=not is_verbose(),
    timeout=execution_timeout,
)

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Kill entire process group on timeout

Modify the subprocess call to start it in a new process group and update the
timeout logic to terminate the entire group, ensuring all child processes are
also killed.

krkn_ai/utils/init.py [36-54]

+import os, signal
+
 process = subprocess.Popen(
-    command_list, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
+    command_list,
+    stdout=subprocess.PIPE,
+    stderr=subprocess.STDOUT,
+    text=True,
+    preexec_fn=os.setsid
 )
 ...
 def kill_on_timeout():
     timed_out.set()
     logger.error(
         "Command timed out after %d seconds, terminating: %s", timeout, command_list[0]
     )
-    process.terminate()
-    # Give process time to terminate gracefully, then force kill
+    # Terminate the entire process group
+    os.killpg(os.getpgid(process.pid), signal.SIGTERM)
     try:
         process.wait(timeout=5)
     except subprocess.TimeoutExpired:
         logger.warning("Process did not terminate gracefully, sending SIGKILL")
-        process.kill()
+        os.killpg(os.getpgid(process.pid), signal.SIGKILL)

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 8

__

Why: This is a robust improvement that prevents orphaned child processes by terminating the entire process group on timeout, which is a common issue when killing shell commands that may spawn their own children.

Medium
General
Avoid immediate kill for zero timeout

Change the condition for starting the timeout timer to timeout > 0 to prevent
immediate process termination when the timeout value is zero.

krkn_ai/utils/init.py [57-59]

-if timeout is not None:
+if timeout is not None and timeout > 0:
     timer = threading.Timer(timeout, kill_on_timeout)
     timer.start()
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: This suggestion correctly identifies that a timeout of 0 would cause immediate process termination and proposes a sensible fix, improving the function's robustness against edge-case input values.

Low
  • More

Copy link
Collaborator

@rh-rahulshetty rh-rahulshetty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a small observation with how we enable this feature. Implementation wise looks great!


# Run command (show logs when verbose mode is enabled)
# Timeout = wait_duration + 120 seconds buffer for initialization/teardown
execution_timeout = self.config.wait_duration + 120
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should disable it by default and let end-user configure it per their use-case. Krkn Scenarios can take over 240 seconds duration depending on the scenario itself. If we start restricting this here without knowing the user application setup or the cluster details and type of scenarios they plan to run, then we might end up killing Krkn-AI tests half way through the scenario execution itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants