Fix zombie health check watcher threads on krkn runner failures#112
Fix zombie health check watcher threads on krkn runner failures#112WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
Conversation
Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||
User description
Fix: Zombie Background Thread Leak in
krkn_runnerSummary
This PR fixes a critical background thread leak in
krkn_runner.pywhere thehealth_check_watcherthread could remain alive if the krkn subprocess crashes, hangs, or raises an exception.The issue causes zombie threads, resource leaks, and non-terminating
krkn-ai runexecutions in real production environments.The fix ensures that all background threads are always stopped, even when failures occur.
Problem Description
File:
krkn_ai/runner/krkn_runner.pyComponent:
health_check_watcherlifecycle managementThe runner starts a background health check watcher thread before launching the krkn subprocess.
However, if the subprocess crashes or exits unexpectedly, the watcher thread is never stopped.
Root Cause
The watcher lifecycle was not guarded by
try...finally, so any exception or early exit skips the cleanup path.This leads to:
Why This Is Critical
krkn-ai runin staging and production clustersThis is not theoretical — subprocess crashes and SIGINT/SIGTERM are common in chaos workflows.
How to Reproduce (Realistic)
krkn-ai runwith a valid scenario.health_check_watcherthread remains aliveFix Applied
try...finallyblockhealth_check_watcher.stop()is always calledThis preserves existing behavior while making cleanup deterministic.
Impact After Fix
Files Changed
krkn_ai/runner/krkn_runner.pytests/test_krkn_runner_leak.pywalkthrough.md(documentation)Regression Test Added
New test:
tests/test_krkn_runner_leak.pyThe test:
main, passes with this fixThis prevents future regressions.

Severity
High — Production impacting
PR Type
Bug fix, Tests
Description
Wrap runner execution in try-finally block to ensure cleanup
Prevent zombie health_check_watcher threads on subprocess failures
Add regression test simulating subprocess crash scenarios
Guarantee thread cleanup on exceptions, crashes, and interrupts
Diagram Walkthrough
File Walkthrough
krkn_runner.py
Add try-finally for guaranteed watcher cleanupkrkn_ai/chaos_engines/krkn_runner.py
test_krkn_runner_leak.py
Add test for health check watcher cleanuptests/test_krkn_runner_leak.py