[Feature]: Add retry mechanism when validating Prometheus connectivity

### Summary

Krkn-AI currently fails immediately when Prometheus connectivity validation fails, even though Prometheus may be temporarily unavailable during chaos experiments. Adding a retry mechanism with backoff would improve resilience and reduce false-negative failures during runs.

### Description

Krkn-AI validates Prometheus connectivity during initialization by executing a test query.
If this validation fails once, the entire run aborts immediately.

In real-world chaos engineering scenarios, Prometheus can experience temporary unavailability due to:
- network disruption
- control-plane pressure
- monitoring pod restarts
- short-lived API latency spikes

A retry mechanism with controlled backoff would allow Krkn-AI to tolerate transient failures while still failing fast in genuine misconfiguration scenarios.

This feature would improve stability and usability, especially during aggressive chaos experiments.

### Background and Context

Krkn-AI relies heavily on Prometheus for:
- fitness evaluation
- scenario scoring
- health verification

However, chaos experiments themselves may momentarily impact:
- Prometheus API availability
- authentication token validation
- network routing

Currently, a single failed Prometheus query causes the entire Krkn-AI run to terminate, even when the service becomes available seconds later.

This behavior limits Krkn-AI’s effectiveness in real chaos testing environments.

### Current Code Analysis

Current implementation in:

`krkn_ai/utils/prometheus.py`

`client = KrknPrometheus(url, token)`
`client.process_query("1")`

If process_query() raises an exception, Krkn-AI immediately throws PrometheusConnectionError and exits.

There is no retry, delay, or tolerance window for temporary failures.

### Issues Identified

1. Single transient Prometheus failure aborts entire run
2. No retry or backoff mechanism
3. Chaos experiments can self-disrupt monitoring temporarily
4. No distinction between transient outage vs permanent misconfiguration
5. Reduced reliability in real production-like environments 

### Problem

Krkn-AI cannot tolerate short-lived Prometheus unavailability, even though such failures are expected during chaos experiments.

This leads to:
- false-negative failures
- premature termination of experiments
- reduced confidence in Krkn-AI automation
- poor resilience under real chaos conditions

### Proposed Solution

Introduce a retry mechanism when validating Prometheus connectivity:
- Retry failed Prometheus test queries multiple times
- Use exponential backoff between attempts
- Preserve immediate failure for permanent misconfiguration
- Skip validation entirely when MOCK_FITNESS is enabled

This would allow Krkn-AI to:
- recover from transient outages
- continue experiments safely
- remain strict on genuine configuration errors

### Implementation Steps

- Update _validate_and_create_client() to support retries
- Add configurable retry count and delay (optional)
- Implement exponential backoff between attempts
- Preserve existing behavior when MOCK_FITNESS is enabled
- Improve error messaging to include retry attempts 
- Add unit tests for retry logic

### Acceptance Criteria

_No response_

### Testing Checklist

_No response_

### Benefits

_No response_

### Alternatives Considered

_No response_

### Related Documentation

_No response_

### Additional Context

This feature directly complements Issue #25 by ensuring that dependency services such as Prometheus are handled more safely and robustly during chaos runs.

Retry-based validation aligns Krkn-AI behavior with real-world expectations in production-grade chaos engineering environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add retry mechanism when validating Prometheus connectivity #132

Summary

Description

Background and Context

Current Code Analysis

Issues Identified

Problem

Proposed Solution

Implementation Steps

Acceptance Criteria

Testing Checklist

Benefits

Alternatives Considered

Related Documentation

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Add retry mechanism when validating Prometheus connectivity #132

Description

Summary

Description

Background and Context

Current Code Analysis

Issues Identified

Problem

Proposed Solution

Implementation Steps

Acceptance Criteria

Testing Checklist

Benefits

Alternatives Considered

Related Documentation

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions