-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or requestpriority: criticalP0 - Must fix for production stabilityP0 - Must fix for production stabilityresilienceError handling, retries, rate limitingError handling, retries, rate limiting
Description
Summary
All controllers either return errors immediately or use a fixed requeue delay. Transient API failures cause tight retry loops that can overwhelm the UptimeRobot API and generate excessive log noise.
Current Behaviour
- Monitor controller: fixed 2s requeue when contacts not ready
- All controllers: immediate error return on API failures (controller-runtime default backoff)
- No distinction between transient and permanent errors
- No jitter on requeue intervals
Proposed Changes
- Wrap API errors with retryable/non-retryable classification
- Use
RequeueAfterwith exponential backoff for transient errors (e.g. 5s, 10s, 20s, 40s, cap at 5m) - Add jitter (±10-20%) to prevent thundering herd across resources
- Return permanent errors without requeue to avoid infinite retry loops
- Account and Contact controllers should requeue on transient failures (currently they don't)
Acceptance Criteria
- Transient API errors (5xx, timeouts, connection errors) trigger backoff requeue
- Permanent errors (400, 401, 403, 404) do not cause retry loops
- Backoff increases exponentially with a configurable cap
- Jitter prevents synchronised retries
- Account and Contact controllers handle transient failures
- Unit tests validate backoff behaviour
Phase
P0 — Resilience & Correctness (Phase 1: Harden)
Reactions are currently unavailable
Metadata
Metadata
Labels
enhancementNew feature or requestNew feature or requestpriority: criticalP0 - Must fix for production stabilityP0 - Must fix for production stabilityresilienceError handling, retries, rate limitingError handling, retries, rate limiting
Projects
Status
In progress