Skip to content

feat: Implement ML-based recommendation system for chaos scenarios#145

Open
DhruvTotala wants to merge 1 commit intokrkn-chaos:mainfrom
DhruvTotala:Issue(76)
Open

feat: Implement ML-based recommendation system for chaos scenarios#145
DhruvTotala wants to merge 1 commit intokrkn-chaos:mainfrom
DhruvTotala:Issue(76)

Conversation

@DhruvTotala
Copy link

@DhruvTotala DhruvTotala commented Feb 1, 2026

User description

This pull request adds the first version of a machine learning–based recommendation system to Krkn AI.

The goal is to help users decide which chaos scenario to run by looking at the current state of the cluster. Based on telemetry data such as CPU usage, memory usage, and network behavior, the system recommends the most relevant chaos experiment (for example, whether a CPU hog or memory hog scenario would have more impact).
Since there is no historical labeled data available yet, this PR also includes a synthetic data generator. This allows us to create realistic fake telemetry data and train an initial model so the recommendation system can work from day one.

What’s Included in This PR :-

Recommendation Engine

Introduces a new ScenarioRecommender class.
Collects aggregated cluster metrics from Prometheus.
Uses a Random Forest machine learning model to predict which chaos scenario is most suitable for the current system conditions.

Command-Line Support

Adds a new CLI command:
Users can optionally provide a Prometheus URL.
The trained model is loaded from a file (default: krkn_model.pkl).

Training Script

Adds a utility script to:
Generate synthetic telemetry data.
Train an initial machine learning model using that data.
This helps bootstrap the system until real-world data becomes available.


PR Type

Enhancement


Description

  • Adds ML-based recommendation system for chaos scenarios

  • Implements ScenarioRecommender class with telemetry collection

  • Includes training script with synthetic data generator

  • Adds CLI command to recommend scenarios based on cluster metrics


Diagram Walkthrough

flowchart LR
  A["Prometheus Metrics"] -->|collect_telemetry| B["ScenarioRecommender"]
  C["Synthetic Data Generator"] -->|train| B
  B -->|load_model| D["Trained ML Model"]
  D -->|recommend| E["Chaos Scenario"]
  F["CLI recommend command"] -->|uses| B
Loading

File Walkthrough

Relevant files
Enhancement
cmd.py
Add recommend CLI command with ML integration                       

krkn_ai/cli/cmd.py

  • Adds new recommend CLI command for scenario recommendations
  • Imports ScenarioRecommender and prometheus client utilities
  • Implements telemetry collection and model-based prediction workflow
  • Includes error handling for missing models and Prometheus connection
    issues
+78/-0   
__init__.py
Initialize recommendation module package                                 

krkn_ai/recommendation/init.py

  • Creates new recommendation module package
  • Exports ScenarioRecommender class for public API
+3/-0     
recommender.py
Implement core recommendation engine logic                             

krkn_ai/recommendation/recommender.py

  • Implements ScenarioRecommender class with Random Forest model
  • Collects telemetry data (CPU, memory, network) from Prometheus
  • Provides train, recommend, save_model, and load_model methods
  • Handles model persistence using joblib serialization
+89/-0   
train_model.py
Add model training script with synthetic data                       

scripts/train_model.py

  • Creates synthetic telemetry data generator with rule-based labeling
  • Trains Random Forest model on generated data
  • Implements heuristic logic mapping metrics to chaos scenarios
  • Includes test prediction to validate model functionality
+79/-0   
Dependencies
requirements.txt
Add scikit-learn dependency                                                           

requirements.txt

  • Adds scikit-learn dependency for Random Forest classifier
+1/-0     

@qodo-code-review
Copy link

qodo-code-review bot commented Feb 1, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Insecure deserialization

Description: Insecure deserialization risk: joblib.load(path) loads a pickle-based model from a
user-controllable --model-path, which can enable arbitrary code execution if an attacker
supplies a malicious .pkl file (e.g., via a swapped file in the working directory or a
downloaded model).
recommender.py [83-88]

Referred Code
def load_model(self, path: str):
    logger.info(f"Loading model from {path}")
    try:
        self.model = joblib.load(path)
    except Exception as e:
        logger.error(f"Failed to load model: {e}")
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Silent metric fallback: Telemetry collection swallows Prometheus query failures and missing data by defaulting
metrics to 0.0, which can lead to incorrect recommendations without a clear failure signal
to callers.

Referred Code
try:
    # We use process_query to get instant vector
    result = self.prom_client.process_query(query)
    if result and len(result) > 0 and 'value' in result[0]:
        # result[0]['value'] is [timestamp, "value"]
        val = float(result[0]['value'][1])
        data[name] = val
    else:
        logger.warning(f"No data found for {name}, defaulting to 0")
        data[name] = 0.0
except Exception as e:
    logger.error(f"Failed to query {name}: {e}")
    data[name] = 0.0

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
User-facing stack traces: The CLI uses logger.exception(...) for generic exceptions which typically prints stack
traces to the console, potentially exposing internal implementation details to end users.

Referred Code
except Exception as e:
    logger.exception(f"An error occurred: {e}")
    exit(1)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unsafe model deserialization: The recommender loads a user-supplied model file via joblib.load(path) without integrity
validation or trust boundaries, enabling arbitrary code execution if a malicious pickle is
provided.

Referred Code
def load_model(self, path: str):
    logger.info(f"Loading model from {path}")
    try:
        self.model = joblib.load(path)
    except Exception as e:
        logger.error(f"Failed to load model: {e}")

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
Missing actor context: The new recommend CLI flow logs actions and outcomes but does not include any user/actor
identifier context, making it difficult to attribute actions if this command is used in
shared or automated environments.

Referred Code
init_logger(None, verbose >= 2)
logger = get_logger(__name__)

if not os.path.exists(model_path):
    logger.error(
        f"Model not found at {model_path}. Please train a model first or specify a valid path."
    )
    exit(1)

# Set env vars for prometheus client creation if provided explicitly
if prometheus_url:
    os.environ["PROMETHEUS_URL"] = prometheus_url
if prometheus_token:
    os.environ["PROMETHEUS_TOKEN"] = prometheus_token

try:
    # Create prometheus client using existing utility
    prom_client = create_prometheus_client(kubeconfig)

    recommender = ScenarioRecommender(prom_client, model_path)



 ... (clipped 17 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status:
Telemetry logged verbatim: The CLI logs full collected telemetry data (telemetry_df.to_string(...)) which may leak
sensitive operational cluster metrics into logs depending on deployment log retention and
access controls.

Referred Code
logger.info("Collected Telemetry:\n%s", telemetry_df.to_string(index=False))

# Predict
recommendation = recommender.recommend(telemetry_df)

click.echo(f"\nRecommended Chaos Scenario: {recommendation}")
logger.info(f"Recommendation: {recommendation}")

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link

qodo-code-review bot commented Feb 1, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Replace the ML model with a simple rule-based engine

The suggestion is to replace the machine learning model with a simple rule-based
engine. This is because the model is trained on synthetic data generated from
the same hardcoded rules, making the ML approach unnecessarily complex.

Examples:

scripts/train_model.py [16-53]
def generate_synthetic_data(n_samples=1000):
    """
    Generates synthetic telemetry data with labeled chaos scenarios.
    
    Logic:
    - High CPU, Low Memory -> cpu-hog
    - Low CPU, High Memory -> memory-hog
    - High Network -> network-chaos
    - Balanced/Normal -> random/pod-delete (as a generic fallback)
    """

 ... (clipped 28 lines)
krkn_ai/recommendation/recommender.py [11-89]
class ScenarioRecommender:
    def __init__(self, prom_client: KrknPrometheus, model_path: str = None):
        self.prom_client = prom_client
        self.model_path = model_path
        self.model = None
        self.feature_names = ["cpu_usage", "memory_usage", "network_io"]
        
        if model_path and os.path.exists(model_path):
            self.load_model(model_path)


 ... (clipped 69 lines)

Solution Walkthrough:

Before:

# scripts/train_model.py
def generate_synthetic_data():
    # ...
    if cpu > 0.8 and memory < 0.5:
        label = "cpu-hog"
    elif memory > 0.8 and cpu < 0.5:
        label = "memory-hog"
    # ...
    return data, labels

X, y = generate_synthetic_data()
recommender = ScenarioRecommender(prom_client=None)
recommender.train(X, y, save_path="krkn_model.pkl")

# krkn_ai/recommendation/recommender.py
class ScenarioRecommender:
    def __init__(self, model_path):
        self.model = joblib.load(model_path)
    
    def recommend(self, telemetry_data):
        return self.model.predict(telemetry_data)

After:

# krkn_ai/recommendation/recommender.py
class ScenarioRecommender:
    def recommend(self, telemetry_data: pd.DataFrame) -> str:
        metrics = telemetry_data.iloc[0]
        cpu = metrics["cpu_usage"]
        memory = metrics["memory_usage"]
        network = metrics["network_io"]

        if cpu > 0.8 and memory < 0.5:
            return "cpu-hog"
        elif memory > 0.8 and cpu < 0.5:
            return "memory-hog"
        elif network > 800:
            return "network-chaos"
        else:
            return "pod-delete"
Suggestion importance[1-10]: 9

__

Why: This is a critical design suggestion that correctly identifies that the ML model is redundant because it only learns to approximate the hardcoded rules used for synthetic data generation, adding unnecessary complexity.

High
Security
Address insecure model deserialization vulnerability

Add a security warning before loading a model with joblib.load, as it can be
insecure and lead to arbitrary code execution if the model file is from an
untrusted source.

krkn_ai/recommendation/recommender.py [83-89]

 def load_model(self, path: str):
+    logger.warning(
+        "Loading a model file with joblib.load is not secure. "
+        "Only load models from a trusted source."
+    )
     logger.info(f"Loading model from {path}")
     try:
         self.model = joblib.load(path)
     except Exception as e:
         logger.error(f"Failed to load model: {e}")
         raise
  • Apply / Chat
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a critical security vulnerability (arbitrary code execution) from deserializing a user-provided model file and proposes a reasonable mitigation by adding a warning.

High
Possible issue
Improve network metric query accuracy

Update the Prometheus query for network_io to include both received and
transmitted bytes for a more accurate network traffic measurement.

krkn_ai/recommendation/recommender.py [32-33]

 # Basic network I/O sum across cluster
-"network_io": 'sum(rate(container_network_receive_bytes_total[5m]))'
+"network_io": 'sum(rate(container_network_receive_bytes_total[5m])) + sum(rate(container_network_transmit_bytes_total[5m]))'
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that the network metric is incomplete, and proposes a change that will improve the accuracy of the collected telemetry data, leading to better recommendations.

Medium
General
Use ctx.exit for errors

Replace exit(1) with ctx.exit(1) to properly terminate the Click command and
allow the framework to handle cleanup.

krkn_ai/cli/cmd.py [247-251]

 if not os.path.exists(model_path):
     logger.error(
         f"Model not found at {model_path}. Please train a model first or specify a valid path."
     )
-    exit(1)
+    ctx.exit(1)
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly recommends using ctx.exit(1) which is the idiomatic way to exit a Click application, ensuring proper cleanup and integration with the framework.

Low
Remove duplicate log statement

Remove the duplicated log statement in the discover command.

krkn_ai/cli/cmd.py [205-206]

-logger.info("Saved component configuration to %s", output)
 logger.info("Saved component configuration to %s", output)
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies and removes a redundant log message, which cleans up the code and command output.

Low
  • Update

Signed-off-by: dhruv <dhruvtotla30@gmail.com>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training the model on randomly generated (rule-based) data feels somewhat meaningless from an ML perspective.
If the training data is synthetic and deterministic, this effectively behaves like a hard-coded rule, rather than a model learning from real behavior.
Additionally, the current feature set includes only three telemetry signals, which makes the model largely blind.

Given the scope of this project, it would be more appropriate to train the recommender using real time cluster telemetry, ideally in a time-series context from a live or representative environment.

You should also refer to this comment by @rh-rahulshetty in a previously open PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants