Skip to content

feat(controller): add exponential backoff for transient API errors #49

@joelp172

Description

@joelp172

Summary

All controllers either return errors immediately or use a fixed requeue delay. Transient API failures cause tight retry loops that can overwhelm the UptimeRobot API and generate excessive log noise.

Current Behaviour

  • Monitor controller: fixed 2s requeue when contacts not ready
  • All controllers: immediate error return on API failures (controller-runtime default backoff)
  • No distinction between transient and permanent errors
  • No jitter on requeue intervals

Proposed Changes

  • Wrap API errors with retryable/non-retryable classification
  • Use RequeueAfter with exponential backoff for transient errors (e.g. 5s, 10s, 20s, 40s, cap at 5m)
  • Add jitter (±10-20%) to prevent thundering herd across resources
  • Return permanent errors without requeue to avoid infinite retry loops
  • Account and Contact controllers should requeue on transient failures (currently they don't)

Acceptance Criteria

  • Transient API errors (5xx, timeouts, connection errors) trigger backoff requeue
  • Permanent errors (400, 401, 403, 404) do not cause retry loops
  • Backoff increases exponentially with a configurable cap
  • Jitter prevents synchronised retries
  • Account and Contact controllers handle transient failures
  • Unit tests validate backoff behaviour

Phase

P0 — Resilience & Correctness (Phase 1: Harden)

Metadata

Metadata

Labels

enhancementNew feature or requestpriority: criticalP0 - Must fix for production stabilityresilienceError handling, retries, rate limiting

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions