Skip to content

Latest commit

 

History

History
177 lines (133 loc) · 4.5 KB

File metadata and controls

177 lines (133 loc) · 4.5 KB

Hackathon Demo: AI-Powered Incident Response System

Overview

This demo showcases a complete incident response workflow using Motia's event-driven architecture with three key steps:

  1. Ingest (API Step) - Receives incident alerts
  2. Analyze (Event Step) - AI-powered analysis and decision making
  3. Remediate (Event Step) - Durable workflow with idempotency

Architecture

HTTP POST → 1-ingest.step.ts → [incident.detected] → 2-analyze.step.ts → [fix.approved] → 3-remediate.step.ts

Files Created

  • steps/1-ingest.step.ts - API endpoint for incident ingestion
  • steps/2-analyze.step.ts - AI analysis engine
  • steps/3-remediate.step.ts - Durable remediation workflow

How to Run

1. Start the Development Server

npm run dev

This will start:

  • The Motia backend server
  • The Workbench UI (visual workflow designer)

2. Test the Workflow

Send an Incident Alert

curl -X POST http://localhost:3000/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "serviceName": "payment-service",
    "severity": "critical",
    "message": "High memory usage detected - 95% utilization"
  }'

Expected response:

{
  "status": "accepted",
  "incidentId": "incident-1234567890-abc123"
}

3. Watch the Logs

You'll see the workflow progress through the logs:

  1. Ingest Step: Incident received and emitted
  2. Analyze Step (after ~2 seconds): AI analysis complete, fix approved
  3. Remediate Step (after ~10 seconds): Remediation completed

4. Test Durability (The Cool Part!)

To demonstrate durability and idempotency:

  1. Send an incident alert (as above)
  2. Watch the logs - you'll see "⚠️ DURABILITY TEST: Kill the server now to test recovery! ⚠️"
  3. Kill the server (Ctrl+C) during the 10-second wait
  4. Restart the server with npm run dev
  5. The remediation step will automatically resume from where it left off!

The step checks the state and logs: "Resuming remediation after server restart..."

Key Features Demonstrated

1. Event-Driven Architecture

  • API Step emits incident.detected event
  • Analyze Step subscribes to incident.detected, emits fix.approved
  • Remediate Step subscribes to fix.approved

2. Type Safety

  • All steps use Zod schemas for validation
  • TypeScript types auto-generated in types.d.ts
  • Full type inference across the workflow

3. Durability & Idempotency

  • State management tracks remediation progress
  • Server crashes don't lose work
  • Steps can resume from checkpoints

4. AI Simulation

  • 2-second delay simulates AI processing
  • Decision logic based on severity:
    • critical → restart pod
    • warning → scale resources
    • info → monitor only

Testing Different Scenarios

Critical Incident (triggers restart)

curl -X POST http://localhost:3000/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "serviceName": "auth-service",
    "severity": "critical",
    "message": "Service unresponsive"
  }'

Warning (triggers scaling)

curl -X POST http://localhost:3000/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "serviceName": "api-gateway",
    "severity": "warning",
    "message": "High latency detected"
  }'

Info (monitoring only, no remediation)

curl -X POST http://localhost:3000/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "serviceName": "cache-service",
    "severity": "info",
    "message": "Cache hit rate below threshold"
  }'

Viewing in Workbench

  1. Open the Workbench UI (URL shown in terminal after npm run dev)
  2. Navigate to the workflow visualization
  3. See the three steps connected by events
  4. Watch real-time execution as incidents flow through

State Management

The remediation step uses Motia's state management:

  • Group ID: remediation-status
  • Key: fix-{serviceName}
  • Values: rebootinghealthy

Check state in logs or via Workbench state inspector.

Next Steps for Hackathon

Potential enhancements:

  • Add real LLM integration (OpenAI, Anthropic)
  • Connect to actual Kubernetes API
  • Add streaming status updates to frontend
  • Implement rollback logic
  • Add approval workflow before remediation
  • Create dashboard for incident history

Troubleshooting

Types not found?

npm run generate-types

Server won't start?

  • Check if port 3000 is available
  • Ensure Redis is running (if using BullMQ)

Steps not executing?

  • Check logs for errors
  • Verify event topic names match between emits and subscribes