Skip to content

[BUG]: DB_POOL_SIZE and DB_MAX_OVERFLOW are per-worker, not per-gateway — actual connections = workers × (pool_size + max_overflow) #4645

@madhav165

Description

@madhav165

Summary

Setting DB_POOL_SIZE=15 and DB_MAX_OVERFLOW=30 results in the gateway establishing ~200 database connections instead of the expected ~45. This is because each gunicorn worker gets its own independent connection pool, and additionally the observability pattern opens many simultaneous sessions per request.

Root Cause

Two compounding factors make the actual connection count unintuitive:

1. Per-worker pool multiplication (architecture)

With gunicorn multi-worker deployment:

  • Each forked worker gets its own independent copy of the SQLAlchemy QueuePool (via OS copy-on-write semantics after fork)
  • Pool size limits are enforced per-worker, not per-gateway
  • run-gunicorn.sh auto-detects workers: min(2×CPU+1, 16)

Actual max connections = GUNICORN_WORKERS × (DB_POOL_SIZE + DB_MAX_OVERFLOW)

Workers DB_POOL_SIZE DB_MAX_OVERFLOW Expected (intuitive) Actual Max
4 15 30 45 180
5 15 30 45 225
16 15 30 45 720

2. Observability session proliferation (amplifier)

When OBSERVABILITY_ENABLED=true, each traced request creates 4-6 independent sessions via _get_or_create_observability_session() in observability_service.py:

  • start_trace() → 1 session
  • start_span() → 1 session
  • end_span() → 1 session
  • end_trace() → 1 session
  • Optional: add_event(), record_metric() → 1-2 more

Additionally, SQL instrumentation (instrumentation/sqlalchemy.py) opens 3 sessions per SQL query (start_span + end_span + duration update). A single request running 10 queries → 30 additional sessions from instrumentation alone.

Under even modest concurrency (3-5 concurrent requests per worker), a single worker can saturate all 15 pool connections and eat through all 30 overflow connections.

3. Missing engine.dispose() in post_fork (minor)

gunicorn.config.py:post_fork() resets Redis but does NOT call engine.dispose(). When preload_app=True (Linux default), the parent process creates pool connections that become invalid in children. SQLAlchemy handles this gracefully (invalidates stale connections by PID check), but calling dispose() would be cleaner and follow SQLAlchemy's official multi-process guidance.

Reproduction

# Set small pool size
export DB_POOL_SIZE=15
export DB_MAX_OVERFLOW=30
export OBSERVABILITY_ENABLED=true

# Start with multiple workers
make serve  # auto-detects workers

# Check PostgreSQL connection count
psql -c "SELECT count(*) FROM pg_stat_activity WHERE application_name LIKE '%mcpgateway%';"
# Shows ~150-200+ instead of expected 45

Proposed Fixes

  1. Add startup log warning showing the actual max connection formula:

    WARNING: Total max DB connections = workers(N) × (pool_size + max_overflow) = N × 45 = M
    Ensure PostgreSQL max_connections >= M
    
  2. Reduce observability session proliferation: Batch start_trace/start_span/end_span/end_trace into fewer sessions (reuse a single session per observability lifecycle)

  3. Fix SQL instrumentation: _write_span_to_db() opens 3 sessions — consolidate into 1

  4. Add engine.dispose() in post_fork for clean pool separation per worker (SQLAlchemy best practice)

  5. Consider a dedicated smaller pool for observability writes or rate-limit observability session creation when pool is near capacity

Affected Files

  • mcpgateway/config.py:1920-1921 — default pool size settings
  • mcpgateway/db.py:226 — module-level engine creation
  • mcpgateway/services/observability_service.py:192-218 — independent session creation
  • mcpgateway/instrumentation/sqlalchemy.py:43-88 — 3 sessions per SQL query span
  • gunicorn.config.py:118-128post_fork missing engine.dispose()
  • run-gunicorn.sh:235-248 — worker count auto-detection

Environment

  • SQLAlchemy >= 2.0.49
  • Gunicorn with UvicornWorker
  • PostgreSQL backend with QueuePool (non-PgBouncer)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformancePerformance related itemstriageIssues / Features awaiting triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions