Summary
Multiple periodic tasks and background services in MCP Gateway use fixed intervals without jitter. In multi-node deployments, this creates synchronized workloads where all instances execute the same operations simultaneously, causing:
- Database lock contention when multiple nodes write metrics/cleanup simultaneously
- Network storms when all gateways health-check peers at the same instant
- Redis contention during leader heartbeats and cache operations
- Upstream service overload from synchronized keepalive traffic
Affected Components
High Priority (Database/Network Contention Risk)
| Component |
File |
Line(s) |
Current Interval |
Impact |
| Gateway Health Checks |
services/gateway_service.py |
3404, 3415, 3427, 3431, 3446 |
health_check_interval (300s) |
All gateways ping peers simultaneously |
| Metrics Rollup |
services/metrics_rollup_service.py |
206-224 |
metrics_rollup_interval_hours (1h) |
Multi-node DB writes at hour boundary |
| Metrics Cleanup |
services/metrics_cleanup_service.py |
175-189 |
metrics_cleanup_interval_hours (24h) |
Heavy DB deletes at midnight |
| Redis Leader Heartbeat |
services/gateway_service.py |
3344 |
redis_leader_heartbeat_interval (5s) |
Redis contention in multi-node |
| Metrics Buffer Flush |
services/metrics_buffer_service.py |
365-378 |
metrics_buffer_flush_interval (60s) |
Multiple workers flush simultaneously |
| Federation Discovery |
federation/discovery.py |
620, 643 |
60s / 300s |
Network traffic spikes |
Medium Priority (Connection Keepalives)
| Component |
File |
Line(s) |
Current Interval |
Impact |
| WebSocket Ping |
transports/websocket_transport.py |
327 |
websocket_ping_interval (30s) |
Synchronized ping traffic |
| SSE Keepalive |
transports/sse_transport.py |
362 |
sse_keepalive_interval (30s) |
Bursty keepalive traffic |
| Reverse Proxy Keepalive |
reverse_proxy.py |
517 |
keepalive_interval (30s) |
Synchronized upstream traffic |
| Reverse Proxy Router |
routers/reverse_proxy.py |
350 |
Hardcoded 30s |
Keepalive synchronization |
Lower Priority (Cleanup Tasks)
| Component |
File |
Line(s) |
Current Interval |
Impact |
| Session Registry DB Cleanup |
cache/session_registry.py |
1193, 1200 |
300s / 600s |
Minor DB contention |
| Session Registry Memory Cleanup |
cache/session_registry.py |
1225, 1232 |
60s / 300s |
Memory pressure |
| Resource Cache Cleanup |
cache/resource_cache.py |
233 |
60s |
Minor memory contention |
| Elicitation Cleanup |
services/elicitation_service.py |
226 |
60s |
Memory cleanup |
Retry Logic (Already Has Exponential Backoff, Could Add Jitter)
| Component |
File |
Line(s) |
Current Pattern |
Enhancement |
| OAuth Token Retry |
services/oauth_manager.py |
281 |
2**attempt exponential |
Add jitter: 2**attempt * random(0.5, 1.5) |
| Federation Forward Retry |
federation/forward.py |
503 |
1 * (attempt + 1) linear |
Add jitter: delay * random(0.8, 1.2) |
| LLM Chat Session Lock |
routers/llmchat_router.py |
533, 550 |
Fixed LOCK_WAIT (0.2s) |
Add small jitter to reduce contention |
Code Examples
Current Pattern (No Jitter)
# services/gateway_service.py:3404
await asyncio.sleep(self._health_check_interval)
# services/metrics_rollup_service.py:206
interval_seconds = self.rollup_interval_hours * 3600
await asyncio.wait_for(self._shutdown_event.wait(), timeout=interval_seconds)
Proposed Pattern (With Jitter)
def jittered_interval(base: float, jitter_fraction: float = 0.2) -> float:
"""Add random jitter to spread workload across time window."""
jitter = base * jitter_fraction * random.random()
return base + jitter
# Usage
await asyncio.sleep(jittered_interval(self._health_check_interval, 0.2))
Existing Jitter Support
The codebase already has jitter support in utils/retry_manager.py for HTTP retries:
# config.py:756-760
retry_max_attempts: int = 3
retry_base_delay: float = 1.0
retry_max_delay: int = 60
retry_jitter_max: float = 0.5 # fraction of base delay
# retry_manager.py:304
delay = base + random.uniform(0, jitter_range)
This pattern should be extended to periodic background tasks.
Proposed Configuration
Add new settings to config.py:
# Jitter settings for background tasks
background_task_jitter_fraction: float = Field(
default=0.2,
ge=0.0,
le=0.5,
description="Jitter fraction (0.0-0.5) to add to background task intervals"
)
Recommended Jitter Percentages by Task Type
| Task Type |
Jitter % |
Rationale |
| Health checks (300s) |
20-30% |
60-90s spread prevents network storms |
| Metrics rollup (3600s) |
10% |
6 min spread across hour boundary |
| Metrics cleanup (86400s) |
5% |
~1 hour spread avoids midnight spike |
| Keepalives (30s) |
20% |
6s spread smooths connection traffic |
| Leader heartbeat (5s) |
20% |
1s spread reduces Redis contention |
| Buffer flush (60s) |
25% |
15s spread prevents write bursts |
Implementation Approach
- Create a utility function in
mcpgateway/utils/jitter.py
- Add
background_task_jitter_fraction to settings
- Update high-priority services first (metrics, gateway health checks)
- Add jitter to keepalive loops
- Update retry logic with jitter
Testing Considerations
- Unit tests should mock
random.random() for deterministic behavior
- Integration tests may need adjusted timing tolerances
- Performance tests should verify load distribution
References
Summary
Multiple periodic tasks and background services in MCP Gateway use fixed intervals without jitter. In multi-node deployments, this creates synchronized workloads where all instances execute the same operations simultaneously, causing:
Affected Components
High Priority (Database/Network Contention Risk)
services/gateway_service.pyhealth_check_interval(300s)services/metrics_rollup_service.pymetrics_rollup_interval_hours(1h)services/metrics_cleanup_service.pymetrics_cleanup_interval_hours(24h)services/gateway_service.pyredis_leader_heartbeat_interval(5s)services/metrics_buffer_service.pymetrics_buffer_flush_interval(60s)federation/discovery.pyMedium Priority (Connection Keepalives)
transports/websocket_transport.pywebsocket_ping_interval(30s)transports/sse_transport.pysse_keepalive_interval(30s)reverse_proxy.pykeepalive_interval(30s)routers/reverse_proxy.pyLower Priority (Cleanup Tasks)
cache/session_registry.pycache/session_registry.pycache/resource_cache.pyservices/elicitation_service.pyRetry Logic (Already Has Exponential Backoff, Could Add Jitter)
services/oauth_manager.py2**attemptexponential2**attempt * random(0.5, 1.5)federation/forward.py1 * (attempt + 1)lineardelay * random(0.8, 1.2)routers/llmchat_router.pyLOCK_WAIT(0.2s)Code Examples
Current Pattern (No Jitter)
Proposed Pattern (With Jitter)
Existing Jitter Support
The codebase already has jitter support in
utils/retry_manager.pyfor HTTP retries:This pattern should be extended to periodic background tasks.
Proposed Configuration
Add new settings to
config.py:Recommended Jitter Percentages by Task Type
Implementation Approach
mcpgateway/utils/jitter.pybackground_task_jitter_fractionto settingsTesting Considerations
random.random()for deterministic behaviorReferences