-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Summary
Krkn-AI currently fails immediately when Prometheus connectivity validation fails, even though Prometheus may be temporarily unavailable during chaos experiments. Adding a retry mechanism with backoff would improve resilience and reduce false-negative failures during runs.
Description
Krkn-AI validates Prometheus connectivity during initialization by executing a test query.
If this validation fails once, the entire run aborts immediately.
In real-world chaos engineering scenarios, Prometheus can experience temporary unavailability due to:
- network disruption
- control-plane pressure
- monitoring pod restarts
- short-lived API latency spikes
A retry mechanism with controlled backoff would allow Krkn-AI to tolerate transient failures while still failing fast in genuine misconfiguration scenarios.
This feature would improve stability and usability, especially during aggressive chaos experiments.
Background and Context
Krkn-AI relies heavily on Prometheus for:
- fitness evaluation
- scenario scoring
- health verification
However, chaos experiments themselves may momentarily impact:
- Prometheus API availability
- authentication token validation
- network routing
Currently, a single failed Prometheus query causes the entire Krkn-AI run to terminate, even when the service becomes available seconds later.
This behavior limits Krkn-AI’s effectiveness in real chaos testing environments.
Current Code Analysis
Current implementation in:
krkn_ai/utils/prometheus.py
client = KrknPrometheus(url, token)
client.process_query("1")
If process_query() raises an exception, Krkn-AI immediately throws PrometheusConnectionError and exits.
There is no retry, delay, or tolerance window for temporary failures.
Issues Identified
- Single transient Prometheus failure aborts entire run
- No retry or backoff mechanism
- Chaos experiments can self-disrupt monitoring temporarily
- No distinction between transient outage vs permanent misconfiguration
- Reduced reliability in real production-like environments
Problem
Krkn-AI cannot tolerate short-lived Prometheus unavailability, even though such failures are expected during chaos experiments.
This leads to:
- false-negative failures
- premature termination of experiments
- reduced confidence in Krkn-AI automation
- poor resilience under real chaos conditions
Proposed Solution
Introduce a retry mechanism when validating Prometheus connectivity:
- Retry failed Prometheus test queries multiple times
- Use exponential backoff between attempts
- Preserve immediate failure for permanent misconfiguration
- Skip validation entirely when MOCK_FITNESS is enabled
This would allow Krkn-AI to:
- recover from transient outages
- continue experiments safely
- remain strict on genuine configuration errors
Implementation Steps
- Update _validate_and_create_client() to support retries
- Add configurable retry count and delay (optional)
- Implement exponential backoff between attempts
- Preserve existing behavior when MOCK_FITNESS is enabled
- Improve error messaging to include retry attempts
- Add unit tests for retry logic
Acceptance Criteria
No response
Testing Checklist
No response
Benefits
No response
Alternatives Considered
No response
Related Documentation
No response
Additional Context
This feature directly complements Issue #25 by ensuring that dependency services such as Prometheus are handled more safely and robustly during chaos runs.
Retry-based validation aligns Krkn-AI behavior with real-world expectations in production-grade chaos engineering environments.