Skip to content

[BUG]: Observability session proliferation still creates 4-6 sessions per traced request #5072

@bogdanmariusc10

Description

@bogdanmariusc10

🐞 Bug Summary

PR #4696 partially addressed the database connection pool multiplication issue (#4645) by optimizing SQL instrumentation from 3 sessions to 1 per query. However, the main observability lifecycle still creates 4-6 independent sessions per traced request via _get_or_create_observability_session(), which can saturate connection pools under modest concurrency.

Impact: A traced request with 10 SQL queries now uses 14-16 sessions (down from 34-36), representing a 58% reduction. However, the observability lifecycle itself (start_trace, end_trace, start_span, end_span) still accounts for 4-6 sessions per request and was not optimized in PR #4696.


🧩 Affected Component

  • mcpgateway - API
  • mcpgateway - UI (admin panel)
  • mcpgateway.wrapper - stdio wrapper
  • Federation or Transports
  • CLI, Makefiles, or shell scripts
  • Container setup (Docker/Podman/Compose)
  • Other (explain below)

🔁 Steps to Reproduce

  1. Enable observability:
export OBSERVABILITY_ENABLED=true
export DB_POOL_SIZE=15
export DB_MAX_OVERFLOW=30
  1. Start gateway with multiple workers:
make serve  # auto-detects workers
  1. Send traced requests with moderate concurrency (3-5 concurrent requests per worker)

  2. Monitor database connections:

psql -c "SELECT count(*) FROM pg_stat_activity WHERE application_name LIKE '%mcpgateway%';"
  1. Observe that connection count approaches pool limits even with modest load due to 4-6 sessions per traced request

🤔 Expected Behavior

Each traced request should reuse a single database session for its entire observability lifecycle (start_trace → start_span → end_span → end_trace), rather than creating 4-6 independent sessions.

Target: Reduce from 4-6 sessions per traced request to 1 session per traced request.


📓 Logs / Error Output

Under load, workers may log pool saturation warnings:

WARNING: QueuePool limit of size 15 overflow 30 reached, connection timed out

Current session creation pattern in mcpgateway/services/observability_service.py:

  • start_trace() → creates 1 independent session
  • start_span() → creates 1 independent session
  • end_span() → creates 1 independent session
  • end_trace() → creates 1 independent session
  • Optional: add_event(), record_metric() → 1-2 more sessions

🧠 Environment Info

Key Value
Version or commit v1.0.2
Runtime Python 3.11+, Gunicorn
Platform / OS macOS
Container none

🧩 Additional Context

Related Issues:

Root Cause:
The observability service uses _get_or_create_observability_session() which creates independent sessions for each operation. While PR #4696 added session reuse infrastructure via the obs_db parameter, the main observability lifecycle methods (start_trace, end_trace, start_span, end_span) were not updated to use this pattern.

Proposed Solutions:

  1. Implement session reuse for observability lifecycle (High Priority):

    • Modify start_trace(), end_trace(), start_span(), end_span() to accept and reuse an obs_db session parameter
    • Create a single session at trace start and pass it through the entire lifecycle
    • Target: 4-6 sessions → 1 session per traced request
  2. Add batching for observability operations (High Priority):

    • Batch multiple observability writes into a single transaction
    • Reduce commit overhead and session churn
  3. Consider dedicated observability pool (Nice to Have):

  4. Add performance validation (High Priority):

    • Create load test verifying actual session count under concurrency
    • Measure pool saturation before/after optimization
    • Validate session reduction claims

Affected Files:

  • mcpgateway/services/observability_service.py:192-218 - Independent session creation
  • mcpgateway/instrumentation/sqlalchemy.py - Already optimized in PR fix(db): resolve database connection pool multiplication #4696 (reference implementation)
  • tests/unit/mcpgateway/services/test_observability_service.py - Needs new tests for session reuse

Metadata

Metadata

Labels

apiREST API Related itembugSomething isn't workingicaICA related issuesobservabilityObservability, logging, monitoringperformancePerformance related items

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions