Prevent indefinite hang during chaos execution when Kubernetes API is unreachable by WHOIM1205 · Pull Request #124 · krkn-chaos/krkn-ai

WHOIM1205 · 2026-01-26T19:51:49Z

User description

Fix: Prevent indefinite hang during chaos execution when Kubernetes API is unreachable

Summary

This PR fixes a critical issue where krkn-ai could hang indefinitely while executing chaos scenarios if the Kubernetes API becomes slow or unreachable.

The root cause was an unbounded subprocess execution in run_shell() that waited forever for command output or process exit. Under real Kubernetes failure conditions, krknctl may never return, causing krkn-ai to block permanently with no recovery.

This change introduces a bounded timeout for chaos execution and ensures failures are handled gracefully.

Problem

run_shell() executed krknctl / podman using subprocess.Popen() with:

No timeout
Blocking reads on stdout
A blocking process.wait()

If the Kubernetes API is unreachable or the target resources are deleted mid-execution, krknctl can hang indefinitely, causing krkn-ai to stall forever.

This occurs during exactly the kinds of instability chaos testing is designed to introduce.

Why This Is Critical

Impact before fix

Chaos runs hang indefinitely
CI/CD pipelines stall until externally killed
No cleanup of chaos artifacts
No fitness results returned (silent failure)

Affected users

SREs running chaos in CI pipelines
Reliability engineers using continuous chaos testing
Any user running chaos during cluster instability

Root Cause

run_shell() assumed subprocesses would always complete in bounded time.

This assumption breaks when:

Kubernetes API is slow or unreachable
Network partitions occur
Target nodes or namespaces are deleted mid-execution

krknctl --wait-duration does not cover the initial API connection phase, so it does not prevent this hang.

Fix

1. Add timeout support to `run_shell()`

Introduced an optional timeout parameter
Subprocess is terminated if it exceeds the timeout
Returns -1 on timeout with partial logs preserved

2. Bound chaos execution time

krkn_runner now passes:

PR Type

Bug fix, Tests

Description

Add timeout parameter to run_shell() to prevent indefinite hangs
Implement threading-based timeout with graceful termination and SIGKILL fallback
Return exit code -1 on timeout for proper error handling
Pass execution timeout from krkn_runner with 120-second buffer
Add comprehensive unit tests for timeout and edge cases

Diagram Walkthrough

flowchart LR
  A["run_shell function"] -->|"add timeout parameter"| B["Threading Timer"]
  B -->|"timeout expires"| C["Terminate process"]
  C -->|"graceful fail"| D["Return -1"]
  C -->|"force kill"| D
  E["krkn_runner"] -->|"calculate timeout"| F["wait_duration + 120s"]
  F -->|"pass to run_shell"| A
  G["Unit tests"] -->|"verify timeout behavior"| H["Test suite"]

File Walkthrough

Relevant files

Bug fix

krkn_runner.py `Add timeout parameter to chaos execution` krkn_ai/chaos_engines/krkn_runner.py Calculate execution timeout as `wait_duration + 120` seconds buffer Pass `timeout` parameter to `run_shell()` call Handle return code -1 separately for timeout vs misconfiguration Log timeout errors distinctly from misconfiguration failures	+19/-7
__init__.py `Implement timeout mechanism in run_shell function` krkn_ai/utils/init.py Add optional `timeout` parameter to `run_shell()` function Implement threading.Timer to enforce timeout on subprocess execution Gracefully terminate process on timeout, then force kill if needed Return exit code -1 when timeout occurs Preserve partial logs even when timeout is triggered Add comprehensive docstring with parameter descriptions	+52/-10

Tests

test_run_shell.py `Add comprehensive unit tests for run_shell timeout` tests/unit/utils/test_run_shell.py Add 118 lines of comprehensive unit tests for `run_shell()` function Test basic functionality: successful commands, failures, multiline output Test timeout behavior: fast completion, timeout expiration, partial output Test logging suppression with `do_not_log` parameter Test edge cases: empty output, special characters, large output	+118/-0

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

qodo-code-review · 2026-01-26T19:52:26Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Missing input validation: `run_shell()` can raise an unhandled exception on empty/whitespace `command` because it unconditionally indexes `command_list[0]` after `shlex.split()`. Referred Code def run_shell(command, do_not_log=False, timeout: Optional[int] = None): """ Run shell command and get logs and statuscode in output. Args: command: Shell command to execute do_not_log: If True, suppress debug logging of command output timeout: Maximum execution time in seconds. If None, no timeout is applied. If the timeout expires, the process is terminated and return code -1 is returned. Returns: Tuple of (logs, returncode). On timeout, returncode is -1. """ logs = "" command_list = shlex.split(command) # Let's show the command name being executed logger.debug("Running command: %s", command_list[0]) Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Potential sensitive logs: The new timeout/error logs and existing per-line stdout debug logging could capture sensitive command output depending on what subprocesses emit, which cannot be validated from the diff alone. Referred Code # Use a timer to enforce timeout since we're reading stdout line-by-line timed_out = threading.Event() def kill_on_timeout(): timed_out.set() logger.error( "Command timed out after %d seconds, terminating: %s", timeout, command_list[0] ) process.terminate() # Give process time to terminate gracefully, then force kill try: process.wait(timeout=5) except subprocess.TimeoutExpired: logger.warning("Process did not terminate gracefully, sending SIGKILL") process.kill() timer = None if timeout is not None: timer = threading.Timer(timeout, kill_on_timeout) timer.start() ... (clipped 15 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Timeout value validation: `execution_timeout` is derived directly from `self.config.wait_duration` without visible validation (e.g., None/negative/non-int), which could lead to unexpected timeout behavior at runtime. Referred Code # Timeout = wait_duration + 120 seconds buffer for initialization/teardown execution_timeout = self.config.wait_duration + 120 log, returncode = run_shell( self.process_es_env_string(command, True), do_not_log=not is_verbose(), timeout=execution_timeout, ) Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2026-01-26T19:53:25Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Kill entire process group on timeout Modify the subprocess call to start it in a new process group and update the timeout logic to terminate the entire group, ensuring all child processes are also killed. krkn_ai/utils/init.py [36-54] +import os, signal + process = subprocess.Popen( - command_list, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True + command_list, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + preexec_fn=os.setsid ) ... def kill_on_timeout(): timed_out.set() logger.error( "Command timed out after %d seconds, terminating: %s", timeout, command_list[0] ) - process.terminate() - # Give process time to terminate gracefully, then force kill + # Terminate the entire process group + os.killpg(os.getpgid(process.pid), signal.SIGTERM) try: process.wait(timeout=5) except subprocess.TimeoutExpired: logger.warning("Process did not terminate gracefully, sending SIGKILL") - process.kill() + os.killpg(os.getpgid(process.pid), signal.SIGKILL) `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 8 __ Why: This is a robust improvement that prevents orphaned child processes by terminating the entire process group on timeout, which is a common issue when killing shell commands that may spawn their own children.	Medium
General	Avoid immediate kill for zero timeout Change the condition for starting the timeout timer to `timeout > 0` to prevent immediate process termination when the timeout value is zero. krkn_ai/utils/init.py [57-59] -if timeout is not None: +if timeout is not None and timeout > 0: timer = threading.Timer(timeout, kill_on_timeout) timer.start() Apply / Chat Suggestion importance[1-10]: 5 __ Why: This suggestion correctly identifies that a timeout of 0 would cause immediate process termination and proposes a sensible fix, improving the function's robustness against edge-case input values.	Low
More

rh-rahulshetty

Added a small observation with how we enable this feature. Implementation wise looks great!

rh-rahulshetty · 2026-01-31T11:45:35Z

krkn_ai/chaos_engines/krkn_runner.py


            # Run command (show logs when verbose mode is enabled)
+            # Timeout = wait_duration + 120 seconds buffer for initialization/teardown
+            execution_timeout = self.config.wait_duration + 120


I feel we should disable it by default and let end-user configure it per their use-case. Krkn Scenarios can take over 240 seconds duration depending on the scenario itself. If we start restricting this here without knowing the user application setup or the cluster details and type of scenarios they plan to run, then we might end up killing Krkn-AI tests half way through the scenario execution itself.

Add timeout to run_shell to prevent indefinite hangs

1173ffd

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

WHOIM1205 requested a review from rh-rahulshetty as a code owner January 26, 2026 19:51

qodo-code-review bot added the Review effort 3/5 label Jan 26, 2026

rh-rahulshetty requested changes Jan 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent indefinite hang during chaos execution when Kubernetes API is unreachable#124

Prevent indefinite hang during chaos execution when Kubernetes API is unreachable#124
WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
WHOIM1205:fix/run-shell-timeout

WHOIM1205 commented Jan 26, 2026 •

edited by qodo-code-review bot

Loading

Uh oh!

qodo-code-review bot commented Jan 26, 2026

Uh oh!

qodo-code-review bot commented Jan 26, 2026

Uh oh!

rh-rahulshetty left a comment

Uh oh!

rh-rahulshetty Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WHOIM1205 commented Jan 26, 2026 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Fix: Prevent indefinite hang during chaos execution when Kubernetes API is unreachable

Summary

Problem

Why This Is Critical

Root Cause

Fix

1. Add timeout support to run_shell()

2. Bound chaos execution time

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review bot commented Jan 26, 2026

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Jan 26, 2026

PR Code Suggestions ✨

Uh oh!

rh-rahulshetty left a comment

Choose a reason for hiding this comment

Uh oh!

rh-rahulshetty Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WHOIM1205 commented Jan 26, 2026 •

edited by qodo-code-review bot

Loading

1. Add timeout support to `run_shell()`