Skip to content

feat: resilient background job retry with exponential backoff and monitoring#592

Open
promisingcoder wants to merge 4 commits intorohitdash08:mainfrom
promisingcoder:bounty/130-resilient-jobs
Open

feat: resilient background job retry with exponential backoff and monitoring#592
promisingcoder wants to merge 4 commits intorohitdash08:mainfrom
promisingcoder:bounty/130-resilient-jobs

Conversation

@promisingcoder
Copy link

Summary

Implements resilient background job retry & monitoring for async job execution.

/claim #130

Changes

Retry Mechanism (Exponential Backoff)

  • Configurable via JOB_MAX_RETRIES (default 3) and JOB_RETRY_DELAYS (default "5,15,45" minutes)
  • Backoff: 5min → 15min → 45min, permanently failed after max retries

Job State Tracking

  • Added retry_count, last_error, next_retry_at, failed, retry_status to Reminder model
  • Backward-compatible migration with ADD COLUMN IF NOT EXISTS

Pure Dispatch Function

  • dispatch_reminders(candidates, sender_func, now) — zero DB coupling, fully testable
  • Separate run_dispatch_cycle() wrapper for DB operations

Monitoring Endpoints

  • GET /jobs/status — scheduler health (no auth)
  • GET /jobs/reminders/stats — counts by status (JWT)
  • POST /jobs/reminders/run — manual trigger (JWT)

Tests

  • 21 new tests (43 total), all passing
  • Covers: backoff logic, dispatch success/failure/max-retries, endpoints, auth

Documentation

  • README updated with retry system docs, endpoint reference, env var configuration

Fixes #130

promisingcoder and others added 4 commits March 20, 2026 22:22
… monitoring endpoints

- Add retry state columns to Reminder model: retry_count, last_error,
  next_retry_at, failed, retry_status; include PostgreSQL ALTER TABLE
  compatibility patches in _ensure_schema_compatibility
- Implement dispatch_reminders() in app/services/jobs.py: queries due
  reminders, attempts send_reminder(), and schedules up to 3 retries
  with 5/15/45-minute exponential backoff; permanently marks failed
  after max retries are exhausted
- Wire APScheduler into create_app with a 1-minute interval job;
  suppress scheduler startup when FLASK_ENV=testing or TESTING=True
- Add /jobs blueprint with GET /jobs/status (scheduler state), GET
  /jobs/reminders/stats (JWT, aggregate counts), POST /jobs/reminders/run
  (JWT, on-demand dispatch trigger)
- Add 21 comprehensive tests in test_jobs.py covering success, skip,
  backoff delays, max-retry failure, exception capture, endpoint auth,
  and stats aggregation
- Add _FakeRedis in-memory stub to conftest.py (autouse) so all tests
  run without a live Redis server; fixes pre-existing Redis-related
  test failures across the suite (42 → 43 passing)
- Update README with retry schedule table, column docs, and monitoring
  endpoint reference

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…vars

JOB_MAX_RETRIES (default 3) and JOB_RETRY_DELAYS (default '5,15,45')
can now be set as environment variables instead of hardcoded constants.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- dispatch_reminders(candidates, sender_func, now) is now a pure function
  with no DB calls — only retry/backoff logic
- Added run_dispatch_cycle() wrapper that handles DB fetch + commit
- Updated routes to call run_dispatch_cycle
- Tests refactored: dispatch logic tests use mock objects directly,
  DB filtering tests use run_dispatch_cycle
- All 43 tests pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resilient background job retry & monitoring

1 participant