Skip to content

sparshhbuilds/Alertforge--Event-based-backend-alerting-service.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AlertForge - Event-Driven Alerting Engine

AlertForge is a production-grade, event-driven alerting system demonstrating distributed systems reliability patterns. It features asynchronous ingestion, Redis-based sliding-window rate limiting, SHA-256 fingerprint deduplication, exponential backoff retries with dead-letter queue (DLQ) support, and scheduled AI-driven alert storm summarization.

Built with FastAPI, Celery, Redis, PostgreSQL, React, and Docker.


System Architecture

graph TD
    Client[React Simulator Dashboard] -->|POST /api/v1/alerts| API[FastAPI Ingestion Server]
    API -->|1. Generate Hash & Check| Dedup{Redis Deduplication}
    
    Dedup -->|Active Hash Exists| Suppress[Increment dup_count in PostgreSQL]
    Dedup -->|New Unique Hash| Store[Save Event to PostgreSQL]
    
    Store -->|2. Register Suppression Key| RedisCache[Cache Hash in Redis for 60s]
    Store -->|3. Enqueue Background Job| Queue[Redis Task Broker]
    
    Queue -->|4. Pull Job| Worker[Celery Worker]
    Worker -->|5. Verify Dispatch Rate| RateLimiter{Redis sliding-window rate limiter}
    
    RateLimiter -->|Rate Exceeded| Retry[Trigger Exponential Backoff]
    RateLimiter -->|Under Limit| Dispatch[Send Notifications]
    
    Dispatch -->|Webhook| Slack[Slack Channel]
    Dispatch -->|SMTP| Mailhog[Mailhog Email Catcher]
    
    Retry -->|Max 3 Retries Exceeded| DLQ[Save to Dead-Letter Queue Table]
    
    Scheduler[Celery Beat Scheduler] -->|Every 60s| Summarizer[AI Summarizer Task]
    Summarizer -->|Query Last 5m| DB[(PostgreSQL Database)]
    DB -->|Provide Logs| Summarizer
    Summarizer -->|6. Generate Digest report| Gemini[Google Gemini LLM / Fallback]
    Gemini -->|Save Digest| DB
Loading

Reliability Patterns Demonstrated

Pattern Technical Implementation
Asynchronous Ingestion Ingestion API accepts payloads, writes to SQL database, and returns HTTP 202 Accepted immediately. Dispatch is offloaded to background Celery workers.
sliding-window Deduplication Normalizes incoming alerts (service, event type, severity) and generates a SHA-256 fingerprint. Redis keys with a 60-second time-to-live suppress identical alerts, updating duplicate counts on the parent record.
Exponential Backoff Notifications failed due to network timeouts or rate limits are retried at increasing delay intervals (2^n seconds where n is the retry attempt).
Dead Letter Queue (DLQ) Failed tasks that exceed the max retry threshold (3 retries) are marked as DLQ and routed to a dedicated isolation database table with raw error logs and payload context.
Sliding-Window Rate Limiting Enforces a maximum dispatch rate (e.g., 5 notifications per channel per minute) using Redis Sorted Sets (ZSET) to protect downstream services from spam during outages.
AI Storm Summarizer Celery Beat triggers a cron-like schedule every 60 seconds. If a storm is active (5 or more unique alerts in 5 minutes), the Gemini API groups the logs and compiles root-cause incident reports.

Getting Started

Prerequisites

  • Docker Desktop installed and running

1. Clone and Configure

git clone https://github.com/YOUR_USERNAME/AlertForge.git
cd AlertForge
cp .env.example .env

Define configuration settings in your .env file (both settings are optional; the application runs with mock engines if empty):

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../xxx
GEMINI_API_KEY=your_gemini_api_key

2. Launch Services

Run the application containers:

docker compose up --build

3. Service Access Endpoints

Component URL
React Dashboard http://localhost:5173
API Docs (Swagger) http://localhost:8000/docs
Mailhog Web Console http://localhost:8025

Verification and Testing

Through the Dashboard

  1. Deduplication Verification: Click the "Ingest Incident Alert" button multiple times rapidly. A single alert is created in the database, and subsequent duplicate alerts increment the suppression counter.
  2. Retry and DLQ Verification: Toggle the "Fail Slack Webhook" checkbox and send an alert. Observe the worker retry with exponential delays in the terminal. After exceeding max retries, the status shifts to DLQ.
  3. Rate Limiting Verification: Trigger a simulated storm using the "Simulate Outage Storm" button. Check the worker terminal logs to see rate limiters blocking notification delivery after 5 dispatches.
  4. AI Storm Summarization: Generate a storm. Click "Run AI Summarization" to trigger the summarizer task. The compiled report will display under the AI Storm Digests tab.
  5. Email Delivery Verification: Open the Mailhog console at http://localhost:8025 to view captured notification emails.

Running Automated Integration Tests

Execute the test suite inside the API container:

docker compose exec api pytest -v

Project Structure

AlertForge/
├── docker-compose.yml          # Multi-container service configuration
├── .env.example                # Configuration template
├── README.md                   # Systems documentation
└── backend/
    ├── Dockerfile
    ├── requirements.txt
    └── app/
        ├── main.py             # FastAPI routing and request controllers
        ├── models.py           # SQLAlchemy database schemas (Events, DLQ, Digests)
        ├── schemas.py          # Pydantic validation rules
        ├── config.py           # Settings and configuration loader
        ├── database.py         # SQLAlchemy engine and connection manager
        ├── redis_client.py     # Deduplication and rate limiter logic
        ├── worker.py           # Celery application configuration and schedule
        ├── tasks.py            # Notification worker and task runners
        ├── summarizer.py       # AI Storm Summarizer service
        └── tests/
            └── test_alerts.py  # Automated integration tests

Technologies Used

  • Backend API: Python, FastAPI, SQLAlchemy, Pydantic
  • Background Workers: Celery, Redis (broker and backend)
  • Database: PostgreSQL
  • Cache: Redis ZSET (rate limiting) and TTL (deduplication)
  • AI Engine: Google Gemini 1.5 Flash (with local fallback)
  • Notifications: Slack Webhooks, SMTP (via Mailhog)
  • Frontend: React, Vite, CSS
  • Infrastructure: Docker Compose

About

A production-grade, event-driven alerting engine built with FastAPI, Celery, Redis, and PostgreSQL. Demonstrates core distributed systems patterns including async task queues, deduplication, sliding-window rate limiting, exponential backoff, dead-letter queues (DLQ), and automated AI storm digests.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors