Skip to content

[Feature]: Add retry mechanism when validating Prometheus connectivity #132

@Ayushdk

Description

@Ayushdk

Summary

Krkn-AI currently fails immediately when Prometheus connectivity validation fails, even though Prometheus may be temporarily unavailable during chaos experiments. Adding a retry mechanism with backoff would improve resilience and reduce false-negative failures during runs.

Description

Krkn-AI validates Prometheus connectivity during initialization by executing a test query.
If this validation fails once, the entire run aborts immediately.

In real-world chaos engineering scenarios, Prometheus can experience temporary unavailability due to:

  • network disruption
  • control-plane pressure
  • monitoring pod restarts
  • short-lived API latency spikes

A retry mechanism with controlled backoff would allow Krkn-AI to tolerate transient failures while still failing fast in genuine misconfiguration scenarios.

This feature would improve stability and usability, especially during aggressive chaos experiments.

Background and Context

Krkn-AI relies heavily on Prometheus for:

  • fitness evaluation
  • scenario scoring
  • health verification

However, chaos experiments themselves may momentarily impact:

  • Prometheus API availability
  • authentication token validation
  • network routing

Currently, a single failed Prometheus query causes the entire Krkn-AI run to terminate, even when the service becomes available seconds later.

This behavior limits Krkn-AI’s effectiveness in real chaos testing environments.

Current Code Analysis

Current implementation in:

krkn_ai/utils/prometheus.py

client = KrknPrometheus(url, token)
client.process_query("1")

If process_query() raises an exception, Krkn-AI immediately throws PrometheusConnectionError and exits.

There is no retry, delay, or tolerance window for temporary failures.

Issues Identified

  1. Single transient Prometheus failure aborts entire run
  2. No retry or backoff mechanism
  3. Chaos experiments can self-disrupt monitoring temporarily
  4. No distinction between transient outage vs permanent misconfiguration
  5. Reduced reliability in real production-like environments

Problem

Krkn-AI cannot tolerate short-lived Prometheus unavailability, even though such failures are expected during chaos experiments.

This leads to:

  • false-negative failures
  • premature termination of experiments
  • reduced confidence in Krkn-AI automation
  • poor resilience under real chaos conditions

Proposed Solution

Introduce a retry mechanism when validating Prometheus connectivity:

  • Retry failed Prometheus test queries multiple times
  • Use exponential backoff between attempts
  • Preserve immediate failure for permanent misconfiguration
  • Skip validation entirely when MOCK_FITNESS is enabled

This would allow Krkn-AI to:

  • recover from transient outages
  • continue experiments safely
  • remain strict on genuine configuration errors

Implementation Steps

  • Update _validate_and_create_client() to support retries
  • Add configurable retry count and delay (optional)
  • Implement exponential backoff between attempts
  • Preserve existing behavior when MOCK_FITNESS is enabled
  • Improve error messaging to include retry attempts
  • Add unit tests for retry logic

Acceptance Criteria

No response

Testing Checklist

No response

Benefits

No response

Alternatives Considered

No response

Related Documentation

No response

Additional Context

This feature directly complements Issue #25 by ensuring that dependency services such as Prometheus are handled more safely and robustly during chaos runs.

Retry-based validation aligns Krkn-AI behavior with real-world expectations in production-grade chaos engineering environments.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions