Skip to content

Fix zombie health check watcher threads on krkn runner failures#112

Open
WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
WHOIM1205:fix/zombie-threads-runner
Open

Fix zombie health check watcher threads on krkn runner failures#112
WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
WHOIM1205:fix/zombie-threads-runner

Conversation

@WHOIM1205
Copy link

@WHOIM1205 WHOIM1205 commented Jan 23, 2026

User description

Fix: Zombie Background Thread Leak in krkn_runner

Summary

This PR fixes a critical background thread leak in krkn_runner.py where the health_check_watcher thread could remain alive if the krkn subprocess crashes, hangs, or raises an exception.
The issue causes zombie threads, resource leaks, and non-terminating krkn-ai run executions in real production environments.

The fix ensures that all background threads are always stopped, even when failures occur.


Problem Description

File: krkn_ai/runner/krkn_runner.py
Component: health_check_watcher lifecycle management

The runner starts a background health check watcher thread before launching the krkn subprocess.
However, if the subprocess crashes or exits unexpectedly, the watcher thread is never stopped.

Root Cause

The watcher lifecycle was not guarded by try...finally, so any exception or early exit skips the cleanup path.

This leads to:

  • Orphaned threads
  • Hung CLI execution
  • Resource exhaustion over multiple runs
  • CI pipelines never completing

Why This Is Critical

  • Affects real users running krkn-ai run in staging and production clusters
  • Causes silent hangs instead of clean failures
  • Makes chaos experiments non-deterministic
  • Breaks automation, CI jobs, and long-running discovery runs
  • Accumulates leaked threads over repeated executions

This is not theoretical — subprocess crashes and SIGINT/SIGTERM are common in chaos workflows.


How to Reproduce (Realistic)

  1. Run krkn-ai run with a valid scenario.
  2. Force the krkn subprocess to crash or exit early:
    • Kill the process
    • Inject an invalid scenario
    • Trigger a runtime exception
  3. Observe:
    • Main process exits
    • Background health_check_watcher thread remains alive
    • CLI hangs or never fully terminates

Fix Applied

  • Wrapped the runner execution logic in a try...finally block
  • Ensured health_check_watcher.stop() is always called
  • Guarantees cleanup on:
    • Exceptions
    • Subprocess crashes
    • User interrupts (SIGINT / SIGTERM)

This preserves existing behavior while making cleanup deterministic.



Impact After Fix

  • No more zombie background threads
  • Clean shutdowns on failures
  • Stable CI and automation
  • Predictable chaos experiment behavior
  • Improved reliability for long-running discovery runs

Files Changed

  • krkn_ai/runner/krkn_runner.py
  • tests/test_krkn_runner_leak.py
  • walkthrough.md (documentation)

Regression Test Added

New test: tests/test_krkn_runner_leak.py

The test:

  • Simulates a krkn subprocess failure
  • Verifies the watcher thread is stopped
  • Fails on current main, passes with this fix

This prevents future regressions.
image

Severity

High — Production impacting



PR Type

Bug fix, Tests


Description

  • Wrap runner execution in try-finally block to ensure cleanup

  • Prevent zombie health_check_watcher threads on subprocess failures

  • Add regression test simulating subprocess crash scenarios

  • Guarantee thread cleanup on exceptions, crashes, and interrupts


Diagram Walkthrough

flowchart LR
  A["health_check_watcher.run()"] --> B["try block"]
  B --> C["run_shell command"]
  C --> D["Extract returncode"]
  D --> E["finally block"]
  C -->|Exception| E
  E --> F["health_check_watcher.stop()"]
  F --> G["Cleanup guaranteed"]
Loading

File Walkthrough

Relevant files
Bug fix
krkn_runner.py
Add try-finally for guaranteed watcher cleanup                     

krkn_ai/chaos_engines/krkn_runner.py

  • Wrapped runner execution logic in try-finally block
  • Moved health_check_watcher.stop() to finally clause
  • Ensures cleanup on subprocess crashes, exceptions, and interrupts
  • Preserves existing behavior while making cleanup deterministic
+11/-9   
Tests
test_krkn_runner_leak.py
Add test for health check watcher cleanup                               

tests/test_krkn_runner_leak.py

  • New regression test for thread leak prevention
  • Mocks HealthCheckWatcher and run_shell to simulate subprocess failure
  • Verifies stop() is called even when run_shell raises exception
  • Prevents future regressions of zombie thread issues
+58/-0   

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@qodo-code-review
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Broad exception catch: The new test catches a broad Exception and only prints it, which can mask unexpected
failure modes and does not assert the expected exception behavior explicitly.

Referred Code
print("\n[TEST] Executing runner.run() with failing run_shell...")
try:
    runner.run(scenario, 1)
except Exception as e:
    print(f"[INFO] Caught expected exception: {e}")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Exception printed to stdout: The test prints the caught exception message to stdout, which could expose internal error
details into CI logs depending on what exceptions include in real runs.

Referred Code
print("\n[TEST] Executing runner.run() with failing run_shell...")
try:
    runner.run(scenario, 1)
except Exception as e:
    print(f"[INFO] Caught expected exception: {e}")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Unstructured console output: The test introduces multiple print(...) statements which produce unstructured log output
and may inadvertently include sensitive details if exception messages change.

Referred Code
print("\n[TEST] Executing runner.run() with failing run_shell...")
try:
    runner.run(scenario, 1)
except Exception as e:
    print(f"[INFO] Caught expected exception: {e}")

# Verification
print("[VERIFY] Checking if HealthCheckWatcher.stop() was called...")

# Assert that stop was called
mock_watcher.stop.assert_called_once()
print("[SUCCESS] stop() WAS called. Thread leak prevented.")

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Ensure cleanup on watcher startup failure

Move the health_check_watcher.run() call inside the try block. This ensures
health_check_watcher.stop() is always called in the finally block, even if the
watcher fails to start, preventing potential resource leaks.

krkn_ai/chaos_engines/krkn_runner.py [115-130]

-# Start watching application urls for health checks
-health_check_watcher.run()
+try:
+    # Start watching application urls for health checks
+    health_check_watcher.run()
 
-try:
     # Run command (show logs when verbose mode is enabled)
     log, returncode = run_shell(
         self.process_es_env_string(command, True), do_not_log=not is_verbose()
     )
 
     # Extract return code from run log which is part of telemetry data present in the log
     returncode, run_uuid = self.__extract_returncode_from_run(log, returncode)
     logger.info("Krkn scenario return code: %d", returncode)
 
 finally:
     # Stop watching application urls for health checks
     health_check_watcher.stop()
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: This suggestion correctly identifies a resource leak edge case where health_check_watcher.run() could fail, which is not covered by the PR. Applying this change makes the cleanup logic more robust, fully addressing the class of issue the PR intends to fix.

Medium
General
Use idiomatic exception testing pattern

In test_run_shell_exception_cleanup, replace the try...except block with the
with self.assertRaises(Exception): context manager. This ensures the test
correctly fails if the expected exception is not raised, making the test more
robust.

tests/test_krkn_runner_leak.py [41-51]

 print("\n[TEST] Executing runner.run() with failing run_shell...")
-try:
+with self.assertRaises(Exception) as cm:
     runner.run(scenario, 1)
-except Exception as e:
-    print(f"[INFO] Caught expected exception: {e}")
+print(f"[INFO] Caught expected exception: {cm.exception}")
     
 # Verification
 print("[VERIFY] Checking if HealthCheckWatcher.stop() was called...")
 
 # Assert that stop was called
 mock_watcher.stop.assert_called_once()
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out a flaw in the new test case and proposes using the idiomatic self.assertRaises context manager. This improves the test's reliability by ensuring it fails if the expected exception is not raised.

Medium
  • More

Copy link
Collaborator

@rh-rahulshetty rh-rahulshetty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants