-
Notifications
You must be signed in to change notification settings - Fork 458
Open
Description
Problem
During the November 5th, 2025 outage (12:01PM - 12:06PM IST), all tasks were marked unhealthy because /health endpoint started timing out. The current 5-second timeout is too aggressive and caused a cascade failure that made a 5-minute spike into a complete outage.
Current Configuration
Timeout: 5 seconds
Interval: 15 seconds
Unhealthy threshold: 2 consecutive failures
Healthy threshold: 2 consecutive successes
Issue
When the API experienced a request spike, /health responses exceeded 5 seconds. Health checks failed → tasks marked unhealthy → load balancer removed tasks → remaining tasks overloaded → more failures. Cascade effect.
Proposed Change
Option 1: Increase timeout to [10? 15?] seconds to tolerate brief slowdowns without marking tasks unhealthy.
Option 2: Rethink health checks entirely:
- Use passive health checks based on actual request success rates
- Implement graceful degradation instead of binary healthy/unhealthy
- Add circuit breaker logic to prevent cascade failures
Rationale:
- Better to serve slow requests than mark everything dead
- 5s is too tight for real-world load spikes
- Current approach creates cascading failures instead of preventing them
Questions
- What's acceptable
/healthresponse time under load? - Should we separate health check endpoint from main API?
- Should we adjust failure/success thresholds too?
- Can we use passive health monitoring instead?
Metadata
Metadata
Assignees
Labels
No labels