Prevent indefinite hang during chaos execution when Kubernetes API is unreachable#124
Prevent indefinite hang during chaos execution when Kubernetes API is unreachable#124WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
Conversation
Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
|||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||
rh-rahulshetty
left a comment
There was a problem hiding this comment.
Added a small observation with how we enable this feature. Implementation wise looks great!
|
|
||
| # Run command (show logs when verbose mode is enabled) | ||
| # Timeout = wait_duration + 120 seconds buffer for initialization/teardown | ||
| execution_timeout = self.config.wait_duration + 120 |
There was a problem hiding this comment.
I feel we should disable it by default and let end-user configure it per their use-case. Krkn Scenarios can take over 240 seconds duration depending on the scenario itself. If we start restricting this here without knowing the user application setup or the cluster details and type of scenarios they plan to run, then we might end up killing Krkn-AI tests half way through the scenario execution itself.
User description
Fix: Prevent indefinite hang during chaos execution when Kubernetes API is unreachable
Summary
This PR fixes a critical issue where
krkn-aicould hang indefinitely while executing chaos scenarios if the Kubernetes API becomes slow or unreachable.The root cause was an unbounded subprocess execution in
run_shell()that waited forever for command output or process exit. Under real Kubernetes failure conditions,krknctlmay never return, causingkrkn-aito block permanently with no recovery.This change introduces a bounded timeout for chaos execution and ensures failures are handled gracefully.
Problem
run_shell()executedkrknctl/podmanusingsubprocess.Popen()with:stdoutprocess.wait()If the Kubernetes API is unreachable or the target resources are deleted mid-execution,
krknctlcan hang indefinitely, causingkrkn-aito stall forever.This occurs during exactly the kinds of instability chaos testing is designed to introduce.
Why This Is Critical
Impact before fix
Affected users
Root Cause
run_shell()assumed subprocesses would always complete in bounded time.This assumption breaks when:
krknctl --wait-durationdoes not cover the initial API connection phase, so it does not prevent this hang.Fix
1. Add timeout support to
run_shell()timeoutparameter-1on timeout with partial logs preserved2. Bound chaos execution time
krkn_runnernow passes:PR Type
Bug fix, Tests
Description
Add timeout parameter to
run_shell()to prevent indefinite hangsImplement threading-based timeout with graceful termination and SIGKILL fallback
Return exit code -1 on timeout for proper error handling
Pass execution timeout from
krkn_runnerwith 120-second bufferAdd comprehensive unit tests for timeout and edge cases
Diagram Walkthrough
File Walkthrough
krkn_runner.py
Add timeout parameter to chaos executionkrkn_ai/chaos_engines/krkn_runner.py
wait_duration + 120seconds buffertimeoutparameter torun_shell()call__init__.py
Implement timeout mechanism in run_shell functionkrkn_ai/utils/init.py
timeoutparameter torun_shell()functiontest_run_shell.py
Add comprehensive unit tests for run_shell timeouttests/unit/utils/test_run_shell.py
run_shell()functionoutput
output
do_not_logparameter