Skip to content

# Fix Silent Failure in Composite Scenarios by Evaluating All Scenario Exit Codes#121

Open
WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
WHOIM1205:fix/composite-scenario-exit-status
Open

# Fix Silent Failure in Composite Scenarios by Evaluating All Scenario Exit Codes#121
WHOIM1205 wants to merge 1 commit intokrkn-chaos:mainfrom
WHOIM1205:fix/composite-scenario-exit-status

Conversation

@WHOIM1205
Copy link

@WHOIM1205 WHOIM1205 commented Jan 26, 2026

User description

Summary

This PR fixes a critical silent failure bug in Krkn where composite scenario runs only evaluated the exit status of the first scenario, ignoring failures in subsequent scenarios.

As a result, chaos runs involving multiple scenarios could be incorrectly marked as successful, even when later scenarios failed due to SLO violations or misconfigurations.

This PR ensures that all scenario exit statuses are evaluated and the worst (non-zero) exit status is returned and logged.


Problem Description

Krkn supports composite scenarios via krknctl graph run, where multiple chaos scenarios are executed sequentially and reported in telemetry as an array of results.

However, the return code extraction logic only inspected the first scenario:

exit_status = scenarios[0].get("exit_status", default_returncode)

## Impacted Test Scenarios

The following test cases demonstrate the impact of this fix and prevent regressions in composite scenario handling.

### 1. Composite Scenario With Partial Failure (Primary Case)

**Input Telemetry**
```json
{
  "telemetry": {
    "scenarios": [
      {"name": "pod_scenario", "exit_status": 0},
      {"name": "network_scenario", "exit_status": 2}
    ]
  }
}



___

### **PR Type**
Bug fix


___

### **Description**
- Fix silent failure in composite scenarios by evaluating all exit codes

- Implement worst-case exit status logic prioritizing misconfiguration errors

- Add detailed logging for failed scenarios in composite runs

- Prevent incorrect success marking when subsequent scenarios fail


___

### Diagram Walkthrough


```mermaid
flowchart LR
  A["Extract scenarios from telemetry"] --> B["Iterate all scenarios"]
  B --> C["Collect exit statuses"]
  C --> D["Determine worst status"]
  D --> E["Log failures if found"]
  E --> F["Return worst exit status"]

File Walkthrough

Relevant files
Bug fix
krkn_runner.py
Evaluate all scenario exit codes in composite runs             

krkn_ai/chaos_engines/krkn_runner.py

  • Changed exit status extraction from first scenario only to evaluating
    all scenarios
  • Implemented worst-case exit status logic with prioritization:
    misconfiguration errors (!=0,!=2) > SLO failures (2) > success (0)
  • Added collection and logging of failed scenarios with their names and
    exit codes
  • Updated debug logging to reflect worst exit status from all scenarios
    instead of just first
+29/-5   

Check all scenario exit statuses instead of only the first one.
For composite scenarios, return the worst exit status to prevent
silent failures when subsequent scenarios fail.

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@qodo-code-review
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Log injection risk

Description: Untrusted telemetry fields (scenario_name/exit_status) are logged in failed_scenarios
without sanitization, which could enable log-forging/injection (e.g., scenario names
containing newlines or control characters) if an attacker can influence the run
log/telemetry content.
krkn_runner.py [545-562]

Referred Code
for scenario in scenarios:
    exit_status = scenario.get("exit_status", 0)
    scenario_name = scenario.get("name", "unknown")

    if exit_status != 0:
        failed_scenarios.append((scenario_name, exit_status))
        # Prioritize misconfiguration errors over SLO failures
        if worst_exit_status == 0:
            worst_exit_status = exit_status
        elif exit_status != 2 and worst_exit_status == 2:
            # Misconfiguration error takes precedence over SLO failure
            worst_exit_status = exit_status

if failed_scenarios:
    logger.warning(
        "Scenario failures detected in composite run: %s",
        failed_scenarios
    )
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Wrong default handling: When a scenario is missing exit_status, the new logic defaults it to 0 (success) instead
of using default_returncode, which can incorrectly mark composite runs as successful.

Referred Code
worst_exit_status = 0
failed_scenarios = []

for scenario in scenarios:
    exit_status = scenario.get("exit_status", 0)
    scenario_name = scenario.get("name", "unknown")

    if exit_status != 0:
        failed_scenarios.append((scenario_name, exit_status))
        # Prioritize misconfiguration errors over SLO failures
        if worst_exit_status == 0:
            worst_exit_status = exit_status
        elif exit_status != 2 and worst_exit_status == 2:
            # Misconfiguration error takes precedence over SLO failure
            worst_exit_status = exit_status

if failed_scenarios:
    logger.warning(
        "Scenario failures detected in composite run: %s",
        failed_scenarios
    )


 ... (clipped 5 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Failure details logged: The warning log outputs failed_scenarios (scenario names and exit codes), which may
unintentionally include sensitive identifiers depending on scenario naming conventions and
should be reviewed/redacted as needed.

Referred Code
if failed_scenarios:
    logger.warning(
        "Scenario failures detected in composite run: %s",
        failed_scenarios
    )

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@WHOIM1205
Copy link
Author

hey @rh-rahulshetty
This fixes a silent failure in composite scenarios where only the first scenario’s exit status was evaluated. The runner now inspects all scenario results and propagates the worst failure, ensuring partial chaos failures are no longer reported as success.

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Use default_returncode fallback

Replace the hardcoded 0 with the default_returncode parameter in
scenario.get("exit_status", 0) to handle missing exit codes correctly.

krkn_ai/chaos_engines/krkn_runner.py [546]

-exit_status = scenario.get("exit_status", 0)
+exit_status = scenario.get("exit_status", default_returncode)
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: This is a valid point; using the passed default_returncode instead of a hardcoded 0 makes the function more robust by correctly handling cases where a scenario's exit status is missing.

Medium
General
Simplify exit status priority logic

Refactor the logic for determining the worst_exit_status to improve readability
by separating the priority checks from the main loop.

krkn_ai/chaos_engines/krkn_runner.py [542-556]

 worst_exit_status = 0
 failed_scenarios = []
 
+# Priority: misconfiguration > SLO failure (2) > success (0)
+has_misconfig = any(s.get("exit_status", 0) not in [0, 2] for s in scenarios)
+has_slo_failure = any(s.get("exit_status") == 2 for s in scenarios)
+
+if has_misconfig:
+    worst_exit_status = next((s.get("exit_status") for s in scenarios if s.get("exit_status", 0) not in [0, 2]), 0)
+elif has_slo_failure:
+    worst_exit_status = 2
+
 for scenario in scenarios:
     exit_status = scenario.get("exit_status", 0)
-    scenario_name = scenario.get("name", "unknown")
+    if exit_status != 0:
+        scenario_name = scenario.get("name", "unknown")
+        failed_scenarios.append((scenario_name, exit_status))
 
-    if exit_status != 0:
-        failed_scenarios.append((scenario_name, exit_status))
-        # Prioritize misconfiguration errors over SLO failures
-        if worst_exit_status == 0:
-            worst_exit_status = exit_status
-        elif exit_status != 2 and worst_exit_status == 2:
-            # Misconfiguration error takes precedence over SLO failure
-            worst_exit_status = exit_status
-
  • Apply / Chat
Suggestion importance[1-10]: 2

__

Why: The suggested code is less efficient as it iterates over the scenarios multiple times, whereas the original code uses a single loop, and it is not clearly more readable.

Low
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant