Skip to content

[SaaS] Increase Health Check Timeout to Prevent Cascade Failures #6251

@gagantrivedi

Description

@gagantrivedi

Problem

During the November 5th, 2025 outage (12:01PM - 12:06PM IST), all tasks were marked unhealthy because /health endpoint started timing out. The current 5-second timeout is too aggressive and caused a cascade failure that made a 5-minute spike into a complete outage.

Current Configuration

Timeout: 5 seconds
Interval: 15 seconds
Unhealthy threshold: 2 consecutive failures
Healthy threshold: 2 consecutive successes

Issue

When the API experienced a request spike, /health responses exceeded 5 seconds. Health checks failed → tasks marked unhealthy → load balancer removed tasks → remaining tasks overloaded → more failures. Cascade effect.

Proposed Change

Option 1: Increase timeout to [10? 15?] seconds to tolerate brief slowdowns without marking tasks unhealthy.

Option 2: Rethink health checks entirely:

  • Use passive health checks based on actual request success rates
  • Implement graceful degradation instead of binary healthy/unhealthy
  • Add circuit breaker logic to prevent cascade failures

Rationale:

  • Better to serve slow requests than mark everything dead
  • 5s is too tight for real-world load spikes
  • Current approach creates cascading failures instead of preventing them

Questions

  • What's acceptable /health response time under load?
  • Should we separate health check endpoint from main API?
  • Should we adjust failure/success thresholds too?
  • Can we use passive health monitoring instead?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions