Skip to content

[PERFORMANCE]: Add random jitter to scheduled tasks to prevent thundering herd #1780

@crivetimihai

Description

@crivetimihai

Summary

Multiple periodic tasks and background services in MCP Gateway use fixed intervals without jitter. In multi-node deployments, this creates synchronized workloads where all instances execute the same operations simultaneously, causing:

  • Database lock contention when multiple nodes write metrics/cleanup simultaneously
  • Network storms when all gateways health-check peers at the same instant
  • Redis contention during leader heartbeats and cache operations
  • Upstream service overload from synchronized keepalive traffic

Affected Components

High Priority (Database/Network Contention Risk)

Component File Line(s) Current Interval Impact
Gateway Health Checks services/gateway_service.py 3404, 3415, 3427, 3431, 3446 health_check_interval (300s) All gateways ping peers simultaneously
Metrics Rollup services/metrics_rollup_service.py 206-224 metrics_rollup_interval_hours (1h) Multi-node DB writes at hour boundary
Metrics Cleanup services/metrics_cleanup_service.py 175-189 metrics_cleanup_interval_hours (24h) Heavy DB deletes at midnight
Redis Leader Heartbeat services/gateway_service.py 3344 redis_leader_heartbeat_interval (5s) Redis contention in multi-node
Metrics Buffer Flush services/metrics_buffer_service.py 365-378 metrics_buffer_flush_interval (60s) Multiple workers flush simultaneously
Federation Discovery federation/discovery.py 620, 643 60s / 300s Network traffic spikes

Medium Priority (Connection Keepalives)

Component File Line(s) Current Interval Impact
WebSocket Ping transports/websocket_transport.py 327 websocket_ping_interval (30s) Synchronized ping traffic
SSE Keepalive transports/sse_transport.py 362 sse_keepalive_interval (30s) Bursty keepalive traffic
Reverse Proxy Keepalive reverse_proxy.py 517 keepalive_interval (30s) Synchronized upstream traffic
Reverse Proxy Router routers/reverse_proxy.py 350 Hardcoded 30s Keepalive synchronization

Lower Priority (Cleanup Tasks)

Component File Line(s) Current Interval Impact
Session Registry DB Cleanup cache/session_registry.py 1193, 1200 300s / 600s Minor DB contention
Session Registry Memory Cleanup cache/session_registry.py 1225, 1232 60s / 300s Memory pressure
Resource Cache Cleanup cache/resource_cache.py 233 60s Minor memory contention
Elicitation Cleanup services/elicitation_service.py 226 60s Memory cleanup

Retry Logic (Already Has Exponential Backoff, Could Add Jitter)

Component File Line(s) Current Pattern Enhancement
OAuth Token Retry services/oauth_manager.py 281 2**attempt exponential Add jitter: 2**attempt * random(0.5, 1.5)
Federation Forward Retry federation/forward.py 503 1 * (attempt + 1) linear Add jitter: delay * random(0.8, 1.2)
LLM Chat Session Lock routers/llmchat_router.py 533, 550 Fixed LOCK_WAIT (0.2s) Add small jitter to reduce contention

Code Examples

Current Pattern (No Jitter)

# services/gateway_service.py:3404
await asyncio.sleep(self._health_check_interval)

# services/metrics_rollup_service.py:206
interval_seconds = self.rollup_interval_hours * 3600
await asyncio.wait_for(self._shutdown_event.wait(), timeout=interval_seconds)

Proposed Pattern (With Jitter)

def jittered_interval(base: float, jitter_fraction: float = 0.2) -> float:
    """Add random jitter to spread workload across time window."""
    jitter = base * jitter_fraction * random.random()
    return base + jitter

# Usage
await asyncio.sleep(jittered_interval(self._health_check_interval, 0.2))

Existing Jitter Support

The codebase already has jitter support in utils/retry_manager.py for HTTP retries:

# config.py:756-760
retry_max_attempts: int = 3
retry_base_delay: float = 1.0
retry_max_delay: int = 60
retry_jitter_max: float = 0.5  # fraction of base delay

# retry_manager.py:304
delay = base + random.uniform(0, jitter_range)

This pattern should be extended to periodic background tasks.

Proposed Configuration

Add new settings to config.py:

# Jitter settings for background tasks
background_task_jitter_fraction: float = Field(
    default=0.2,
    ge=0.0,
    le=0.5,
    description="Jitter fraction (0.0-0.5) to add to background task intervals"
)

Recommended Jitter Percentages by Task Type

Task Type Jitter % Rationale
Health checks (300s) 20-30% 60-90s spread prevents network storms
Metrics rollup (3600s) 10% 6 min spread across hour boundary
Metrics cleanup (86400s) 5% ~1 hour spread avoids midnight spike
Keepalives (30s) 20% 6s spread smooths connection traffic
Leader heartbeat (5s) 20% 1s spread reduces Redis contention
Buffer flush (60s) 25% 15s spread prevents write bursts

Implementation Approach

  1. Create a utility function in mcpgateway/utils/jitter.py
  2. Add background_task_jitter_fraction to settings
  3. Update high-priority services first (metrics, gateway health checks)
  4. Add jitter to keepalive loops
  5. Update retry logic with jitter

Testing Considerations

  • Unit tests should mock random.random() for deterministic behavior
  • Integration tests may need adjusted timing tolerances
  • Performance tests should verify load distribution

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafedatabaseperformancePerformance related items
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions