feat: Implement ML-based recommendation system for chaos scenarios by DhruvTotala · Pull Request #145 · krkn-chaos/krkn-ai

DhruvTotala · 2026-02-01T13:46:50Z

User description

This pull request adds the first version of a machine learning–based recommendation system to Krkn AI.

The goal is to help users decide which chaos scenario to run by looking at the current state of the cluster. Based on telemetry data such as CPU usage, memory usage, and network behavior, the system recommends the most relevant chaos experiment (for example, whether a CPU hog or memory hog scenario would have more impact).
Since there is no historical labeled data available yet, this PR also includes a synthetic data generator. This allows us to create realistic fake telemetry data and train an initial model so the recommendation system can work from day one.

What’s Included in This PR :-

Recommendation Engine

Introduces a new ScenarioRecommender class.
Collects aggregated cluster metrics from Prometheus.
Uses a Random Forest machine learning model to predict which chaos scenario is most suitable for the current system conditions.

Command-Line Support

Adds a new CLI command:
Users can optionally provide a Prometheus URL.
The trained model is loaded from a file (default: krkn_model.pkl).

Training Script

Adds a utility script to:
Generate synthetic telemetry data.
Train an initial machine learning model using that data.
This helps bootstrap the system until real-world data becomes available.

PR Type

Enhancement

Description

Adds ML-based recommendation system for chaos scenarios
Implements ScenarioRecommender class with telemetry collection
Includes training script with synthetic data generator
Adds CLI command to recommend scenarios based on cluster metrics

Diagram Walkthrough

flowchart LR
  A["Prometheus Metrics"] -->|collect_telemetry| B["ScenarioRecommender"]
  C["Synthetic Data Generator"] -->|train| B
  B -->|load_model| D["Trained ML Model"]
  D -->|recommend| E["Chaos Scenario"]
  F["CLI recommend command"] -->|uses| B

File Walkthrough

Relevant files

Enhancement

cmd.py `Add recommend CLI command with ML integration` krkn_ai/cli/cmd.py Adds new `recommend` CLI command for scenario recommendations Imports ScenarioRecommender and prometheus client utilities Implements telemetry collection and model-based prediction workflow Includes error handling for missing models and Prometheus connection issues	+78/-0
__init__.py `Initialize recommendation module package` krkn_ai/recommendation/init.py Creates new recommendation module package Exports ScenarioRecommender class for public API	+3/-0
recommender.py `Implement core recommendation engine logic` krkn_ai/recommendation/recommender.py Implements ScenarioRecommender class with Random Forest model Collects telemetry data (CPU, memory, network) from Prometheus Provides train, recommend, save_model, and load_model methods Handles model persistence using joblib serialization	+89/-0
train_model.py `Add model training script with synthetic data` scripts/train_model.py Creates synthetic telemetry data generator with rule-based labeling Trains Random Forest model on generated data Implements heuristic logic mapping metrics to chaos scenarios Includes test prediction to validate model functionality	+79/-0

Dependencies

requirements.txt `Add scikit-learn dependency` requirements.txt Adds scikit-learn dependency for Random Forest classifier	+1/-0

qodo-code-review · 2026-02-01T13:47:25Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Insecure deserialization Description: Insecure deserialization risk: `joblib.load(path)` loads a pickle-based model from a user-controllable `--model-path`, which can enable arbitrary code execution if an attacker supplies a malicious `.pkl` file (e.g., via a swapped file in the working directory or a downloaded model). recommender.py [83-88] Referred Code def load_model(self, path: str): logger.info(f"Loading model from {path}") try: self.model = joblib.load(path) except Exception as e: logger.error(f"Failed to load model: {e}")
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
🔴	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Silent metric fallback: Telemetry collection swallows Prometheus query failures and missing data by defaulting metrics to `0.0`, which can lead to incorrect recommendations without a clear failure signal to callers. Referred Code try: # We use process_query to get instant vector result = self.prom_client.process_query(query) if result and len(result) > 0 and 'value' in result[0]: # result[0]['value'] is [timestamp, "value"] val = float(result[0]['value'][1]) data[name] = val else: logger.warning(f"No data found for {name}, defaulting to 0") data[name] = 0.0 except Exception as e: logger.error(f"Failed to query {name}: {e}") data[name] = 0.0 Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: User-facing stack traces: The CLI uses `logger.exception(...)` for generic exceptions which typically prints stack traces to the console, potentially exposing internal implementation details to end users. Referred Code except Exception as e: logger.exception(f"An error occurred: {e}") exit(1) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Unsafe model deserialization: The recommender loads a user-supplied model file via `joblib.load(path)` without integrity validation or trust boundaries, enabling arbitrary code execution if a malicious pickle is provided. Referred Code def load_model(self, path: str): logger.info(f"Loading model from {path}") try: self.model = joblib.load(path) except Exception as e: logger.error(f"Failed to load model: {e}") Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: Missing actor context: The new `recommend` CLI flow logs actions and outcomes but does not include any user/actor identifier context, making it difficult to attribute actions if this command is used in shared or automated environments. Referred Code init_logger(None, verbose >= 2) logger = get_logger(__name__) if not os.path.exists(model_path): logger.error( f"Model not found at {model_path}. Please train a model first or specify a valid path." ) exit(1) # Set env vars for prometheus client creation if provided explicitly if prometheus_url: os.environ["PROMETHEUS_URL"] = prometheus_url if prometheus_token: os.environ["PROMETHEUS_TOKEN"] = prometheus_token try: # Create prometheus client using existing utility prom_client = create_prometheus_client(kubeconfig) recommender = ScenarioRecommender(prom_client, model_path) ... (clipped 17 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Telemetry logged verbatim: The CLI logs full collected telemetry data (`telemetry_df.to_string(...)`) which may leak sensitive operational cluster metrics into logs depending on deployment log retention and access controls. Referred Code logger.info("Collected Telemetry:\n%s", telemetry_df.to_string(index=False)) # Predict recommendation = recommender.recommend(telemetry_df) click.echo(f"\nRecommended Chaos Scenario: {recommendation}") logger.info(f"Recommendation: {recommendation}") Learn more about managing compliance generic rules or creating your own custom rules
Update

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-code-review · 2026-02-01T13:48:23Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Replace the ML model with a simple rule-based engine The suggestion is to replace the machine learning model with a simple rule-based engine. This is because the model is trained on synthetic data generated from the same hardcoded rules, making the ML approach unnecessarily complex. Examples: scripts/train_model.py [16-53] def generate_synthetic_data(n_samples=1000): """ Generates synthetic telemetry data with labeled chaos scenarios. Logic: - High CPU, Low Memory -> cpu-hog - Low CPU, High Memory -> memory-hog - High Network -> network-chaos - Balanced/Normal -> random/pod-delete (as a generic fallback) """ ... (clipped 28 lines) krkn_ai/recommendation/recommender.py [11-89] class ScenarioRecommender: def __init__(self, prom_client: KrknPrometheus, model_path: str = None): self.prom_client = prom_client self.model_path = model_path self.model = None self.feature_names = ["cpu_usage", "memory_usage", "network_io"] if model_path and os.path.exists(model_path): self.load_model(model_path) ... (clipped 69 lines) Solution Walkthrough: Before: # scripts/train_model.py def generate_synthetic_data(): # ... if cpu > 0.8 and memory < 0.5: label = "cpu-hog" elif memory > 0.8 and cpu < 0.5: label = "memory-hog" # ... return data, labels X, y = generate_synthetic_data() recommender = ScenarioRecommender(prom_client=None) recommender.train(X, y, save_path="krkn_model.pkl") # krkn_ai/recommendation/recommender.py class ScenarioRecommender: def __init__(self, model_path): self.model = joblib.load(model_path) def recommend(self, telemetry_data): return self.model.predict(telemetry_data) After: # krkn_ai/recommendation/recommender.py class ScenarioRecommender: def recommend(self, telemetry_data: pd.DataFrame) -> str: metrics = telemetry_data.iloc[0] cpu = metrics["cpu_usage"] memory = metrics["memory_usage"] network = metrics["network_io"] if cpu > 0.8 and memory < 0.5: return "cpu-hog" elif memory > 0.8 and cpu < 0.5: return "memory-hog" elif network > 800: return "network-chaos" else: return "pod-delete" Suggestion importance[1-10]: 9 __ Why: This is a critical design suggestion that correctly identifies that the ML model is redundant because it only learns to approximate the hardcoded rules used for synthetic data generation, adding unnecessary complexity.	High
Security	Address insecure model deserialization vulnerability Add a security warning before loading a model with `joblib.load`, as it can be insecure and lead to arbitrary code execution if the model file is from an untrusted source. krkn_ai/recommendation/recommender.py [83-89] def load_model(self, path: str): + logger.warning( + "Loading a model file with joblib.load is not secure. " + "Only load models from a trusted source." + ) logger.info(f"Loading model from {path}") try: self.model = joblib.load(path) except Exception as e: logger.error(f"Failed to load model: {e}") raise Apply / Chat Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a critical security vulnerability (arbitrary code execution) from deserializing a user-provided model file and proposes a reasonable mitigation by adding a warning.	High
Possible issue	Improve network metric query accuracy Update the Prometheus query for `network_io` to include both received and transmitted bytes for a more accurate network traffic measurement. krkn_ai/recommendation/recommender.py [32-33] # Basic network I/O sum across cluster -"network_io": 'sum(rate(container_network_receive_bytes_total[5m]))' +"network_io": 'sum(rate(container_network_receive_bytes_total[5m])) + sum(rate(container_network_transmit_bytes_total[5m]))' Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that the network metric is incomplete, and proposes a change that will improve the accuracy of the collected telemetry data, leading to better recommendations.	Medium
General	Use ctx.exit for errors Replace `exit(1)` with `ctx.exit(1)` to properly terminate the Click command and allow the framework to handle cleanup. krkn_ai/cli/cmd.py [247-251] if not os.path.exists(model_path): logger.error( f"Model not found at {model_path}. Please train a model first or specify a valid path." ) - exit(1) + ctx.exit(1) Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly recommends using `ctx.exit(1)` which is the idiomatic way to exit a Click application, ensuring proper cleanup and integration with the framework.	Low
General	Remove duplicate log statement Remove the duplicated log statement in the `discover` command. krkn_ai/cli/cmd.py [205-206] -logger.info("Saved component configuration to %s", output) logger.info("Saved component configuration to %s", output) Apply / Chat Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies and removes a redundant log message, which cleans up the code and command output.	Low
Update

Signed-off-by: dhruv <dhruvtotla30@gmail.com>

Jatinbhardwaj-093 · 2026-02-02T12:59:06Z

scripts/train_model.py

Training the model on randomly generated (rule-based) data feels somewhat meaningless from an ML perspective.
If the training data is synthetic and deterministic, this effectively behaves like a hard-coded rule, rather than a model learning from real behavior.
Additionally, the current feature set includes only three telemetry signals, which makes the model largely blind.

Given the scope of this project, it would be more appropriate to train the recommender using real time cluster telemetry, ideally in a time-series context from a live or representative environment.

You should also refer to this comment by @rh-rahulshetty in a previously open PR.

DhruvTotala requested a review from rh-rahulshetty as a code owner February 1, 2026 13:46

qodo-code-review bot added the Review effort 3/5 label Feb 1, 2026

feat: Implement ML-based recommendation system for chaos scenarios

b58120e

Signed-off-by: dhruv <dhruvtotla30@gmail.com>

DhruvTotala force-pushed the Issue(76) branch from 5283366 to b58120e Compare February 1, 2026 13:52

DhruvTotala mentioned this pull request Feb 1, 2026

[research] ML-Based Recommendation of Krkn Chaos Scenarios #76

Open

Jatinbhardwaj-093 reviewed Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement ML-based recommendation system for chaos scenarios#145

feat: Implement ML-based recommendation system for chaos scenarios#145
DhruvTotala wants to merge 1 commit intokrkn-chaos:mainfrom
DhruvTotala:Issue(76)

DhruvTotala commented Feb 1, 2026 •

edited by qodo-code-review bot

Loading

Uh oh!

qodo-code-review bot commented Feb 1, 2026 •

edited

Loading

Uh oh!

qodo-code-review bot commented Feb 1, 2026 •

edited

Loading

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Jatinbhardwaj-093 Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DhruvTotala commented Feb 1, 2026 • edited by qodo-code-review bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-code-review bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Compliance Guide 🔍

Uh oh!

qodo-code-review bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Jatinbhardwaj-093 Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DhruvTotala commented Feb 1, 2026 •

edited by qodo-code-review bot

Loading

qodo-code-review bot commented Feb 1, 2026 •

edited

Loading

qodo-code-review bot commented Feb 1, 2026 •

edited

Loading