[PERFORMANCE]: Add random jitter to scheduled tasks to prevent thundering herd

## Summary

Multiple periodic tasks and background services in MCP Gateway use fixed intervals without jitter. In multi-node deployments, this creates synchronized workloads where all instances execute the same operations simultaneously, causing:

- **Database lock contention** when multiple nodes write metrics/cleanup simultaneously
- **Network storms** when all gateways health-check peers at the same instant
- **Redis contention** during leader heartbeats and cache operations
- **Upstream service overload** from synchronized keepalive traffic

## Affected Components

### High Priority (Database/Network Contention Risk)

| Component | File | Line(s) | Current Interval | Impact |
|-----------|------|---------|------------------|--------|
| Gateway Health Checks | `services/gateway_service.py` | 3404, 3415, 3427, 3431, 3446 | `health_check_interval` (300s) | All gateways ping peers simultaneously |
| Metrics Rollup | `services/metrics_rollup_service.py` | 206-224 | `metrics_rollup_interval_hours` (1h) | Multi-node DB writes at hour boundary |
| Metrics Cleanup | `services/metrics_cleanup_service.py` | 175-189 | `metrics_cleanup_interval_hours` (24h) | Heavy DB deletes at midnight |
| Redis Leader Heartbeat | `services/gateway_service.py` | 3344 | `redis_leader_heartbeat_interval` (5s) | Redis contention in multi-node |
| Metrics Buffer Flush | `services/metrics_buffer_service.py` | 365-378 | `metrics_buffer_flush_interval` (60s) | Multiple workers flush simultaneously |
| Federation Discovery | `federation/discovery.py` | 620, 643 | 60s / 300s | Network traffic spikes |

### Medium Priority (Connection Keepalives)

| Component | File | Line(s) | Current Interval | Impact |
|-----------|------|---------|------------------|--------|
| WebSocket Ping | `transports/websocket_transport.py` | 327 | `websocket_ping_interval` (30s) | Synchronized ping traffic |
| SSE Keepalive | `transports/sse_transport.py` | 362 | `sse_keepalive_interval` (30s) | Bursty keepalive traffic |
| Reverse Proxy Keepalive | `reverse_proxy.py` | 517 | `keepalive_interval` (30s) | Synchronized upstream traffic |
| Reverse Proxy Router | `routers/reverse_proxy.py` | 350 | Hardcoded 30s | Keepalive synchronization |

### Lower Priority (Cleanup Tasks)

| Component | File | Line(s) | Current Interval | Impact |
|-----------|------|---------|------------------|--------|
| Session Registry DB Cleanup | `cache/session_registry.py` | 1193, 1200 | 300s / 600s | Minor DB contention |
| Session Registry Memory Cleanup | `cache/session_registry.py` | 1225, 1232 | 60s / 300s | Memory pressure |
| Resource Cache Cleanup | `cache/resource_cache.py` | 233 | 60s | Minor memory contention |
| Elicitation Cleanup | `services/elicitation_service.py` | 226 | 60s | Memory cleanup |

### Retry Logic (Already Has Exponential Backoff, Could Add Jitter)

| Component | File | Line(s) | Current Pattern | Enhancement |
|-----------|------|---------|-----------------|-------------|
| OAuth Token Retry | `services/oauth_manager.py` | 281 | `2**attempt` exponential | Add jitter: `2**attempt * random(0.5, 1.5)` |
| Federation Forward Retry | `federation/forward.py` | 503 | `1 * (attempt + 1)` linear | Add jitter: `delay * random(0.8, 1.2)` |
| LLM Chat Session Lock | `routers/llmchat_router.py` | 533, 550 | Fixed `LOCK_WAIT` (0.2s) | Add small jitter to reduce contention |

## Code Examples

### Current Pattern (No Jitter)
```python
# services/gateway_service.py:3404
await asyncio.sleep(self._health_check_interval)

# services/metrics_rollup_service.py:206
interval_seconds = self.rollup_interval_hours * 3600
await asyncio.wait_for(self._shutdown_event.wait(), timeout=interval_seconds)
```

### Proposed Pattern (With Jitter)
```python
def jittered_interval(base: float, jitter_fraction: float = 0.2) -> float:
    """Add random jitter to spread workload across time window."""
    jitter = base * jitter_fraction * random.random()
    return base + jitter

# Usage
await asyncio.sleep(jittered_interval(self._health_check_interval, 0.2))
```

## Existing Jitter Support

The codebase already has jitter support in `utils/retry_manager.py` for HTTP retries:

```python
# config.py:756-760
retry_max_attempts: int = 3
retry_base_delay: float = 1.0
retry_max_delay: int = 60
retry_jitter_max: float = 0.5  # fraction of base delay

# retry_manager.py:304
delay = base + random.uniform(0, jitter_range)
```

This pattern should be extended to periodic background tasks.

## Proposed Configuration

Add new settings to `config.py`:

```python
# Jitter settings for background tasks
background_task_jitter_fraction: float = Field(
    default=0.2,
    ge=0.0,
    le=0.5,
    description="Jitter fraction (0.0-0.5) to add to background task intervals"
)
```

## Recommended Jitter Percentages by Task Type

| Task Type | Jitter % | Rationale |
|-----------|----------|-----------|
| Health checks (300s) | 20-30% | 60-90s spread prevents network storms |
| Metrics rollup (3600s) | 10% | 6 min spread across hour boundary |
| Metrics cleanup (86400s) | 5% | ~1 hour spread avoids midnight spike |
| Keepalives (30s) | 20% | 6s spread smooths connection traffic |
| Leader heartbeat (5s) | 20% | 1s spread reduces Redis contention |
| Buffer flush (60s) | 25% | 15s spread prevents write bursts |

## Implementation Approach

1. Create a utility function in `mcpgateway/utils/jitter.py`
2. Add `background_task_jitter_fraction` to settings
3. Update high-priority services first (metrics, gateway health checks)
4. Add jitter to keepalive loops
5. Update retry logic with jitter

## Testing Considerations

- Unit tests should mock `random.random()` for deterministic behavior
- Integration tests may need adjusted timing tolerances
- Performance tests should verify load distribution

## References

- [Thundering Herd Problem](https://en.wikipedia.org/wiki/Thundering_herd_problem)
- [Jitter in Distributed Systems](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERFORMANCE]: Add random jitter to scheduled tasks to prevent thundering herd #1780

Summary

Affected Components

High Priority (Database/Network Contention Risk)

Medium Priority (Connection Keepalives)

Lower Priority (Cleanup Tasks)

Retry Logic (Already Has Exponential Backoff, Could Add Jitter)

Code Examples

Current Pattern (No Jitter)

Proposed Pattern (With Jitter)

Existing Jitter Support

Proposed Configuration

Recommended Jitter Percentages by Task Type

Implementation Approach

Testing Considerations

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	File	Line(s)	Current Interval	Impact
Gateway Health Checks	`services/gateway_service.py`	3404, 3415, 3427, 3431, 3446	`health_check_interval` (300s)	All gateways ping peers simultaneously
Metrics Rollup	`services/metrics_rollup_service.py`	206-224	`metrics_rollup_interval_hours` (1h)	Multi-node DB writes at hour boundary
Metrics Cleanup	`services/metrics_cleanup_service.py`	175-189	`metrics_cleanup_interval_hours` (24h)	Heavy DB deletes at midnight
Redis Leader Heartbeat	`services/gateway_service.py`	3344	`redis_leader_heartbeat_interval` (5s)	Redis contention in multi-node
Metrics Buffer Flush	`services/metrics_buffer_service.py`	365-378	`metrics_buffer_flush_interval` (60s)	Multiple workers flush simultaneously
Federation Discovery	`federation/discovery.py`	620, 643	60s / 300s	Network traffic spikes

Component	File	Line(s)	Current Interval	Impact
WebSocket Ping	`transports/websocket_transport.py`	327	`websocket_ping_interval` (30s)	Synchronized ping traffic
SSE Keepalive	`transports/sse_transport.py`	362	`sse_keepalive_interval` (30s)	Bursty keepalive traffic
Reverse Proxy Keepalive	`reverse_proxy.py`	517	`keepalive_interval` (30s)	Synchronized upstream traffic
Reverse Proxy Router	`routers/reverse_proxy.py`	350	Hardcoded 30s	Keepalive synchronization

Component	File	Line(s)	Current Interval	Impact
Session Registry DB Cleanup	`cache/session_registry.py`	1193, 1200	300s / 600s	Minor DB contention
Session Registry Memory Cleanup	`cache/session_registry.py`	1225, 1232	60s / 300s	Memory pressure
Resource Cache Cleanup	`cache/resource_cache.py`	233	60s	Minor memory contention
Elicitation Cleanup	`services/elicitation_service.py`	226	60s	Memory cleanup

Component	File	Line(s)	Current Pattern	Enhancement
OAuth Token Retry	`services/oauth_manager.py`	281	`2**attempt` exponential	Add jitter: `2*attempt random(0.5, 1.5)`
Federation Forward Retry	`federation/forward.py`	503	`1 * (attempt + 1)` linear	Add jitter: `delay * random(0.8, 1.2)`
LLM Chat Session Lock	`routers/llmchat_router.py`	533, 550	Fixed `LOCK_WAIT` (0.2s)	Add small jitter to reduce contention

Task Type	Jitter %	Rationale
Health checks (300s)	20-30%	60-90s spread prevents network storms
Metrics rollup (3600s)	10%	6 min spread across hour boundary
Metrics cleanup (86400s)	5%	~1 hour spread avoids midnight spike
Keepalives (30s)	20%	6s spread smooths connection traffic
Leader heartbeat (5s)	20%	1s spread reduces Redis contention
Buffer flush (60s)	25%	15s spread prevents write bursts

[PERFORMANCE]: Add random jitter to scheduled tasks to prevent thundering herd #1780

Description

Summary

Affected Components

High Priority (Database/Network Contention Risk)

Medium Priority (Connection Keepalives)

Lower Priority (Cleanup Tasks)

Retry Logic (Already Has Exponential Backoff, Could Add Jitter)

Code Examples

Current Pattern (No Jitter)

Proposed Pattern (With Jitter)

Existing Jitter Support

Proposed Configuration

Recommended Jitter Percentages by Task Type

Implementation Approach

Testing Considerations

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions