Skip to content

fix(observability): reduce session proliferation from 4-6 to 1 per request#5073

Open
bogdanmariusc10 wants to merge 2 commits into
mainfrom
5072-bug-observability-session-proliferation-still-creates-4-6-sessions-per-traced-request
Open

fix(observability): reduce session proliferation from 4-6 to 1 per request#5073
bogdanmariusc10 wants to merge 2 commits into
mainfrom
5072-bug-observability-session-proliferation-still-creates-4-6-sessions-per-traced-request

Conversation

@bogdanmariusc10
Copy link
Copy Markdown
Collaborator

@bogdanmariusc10 bogdanmariusc10 commented Jun 5, 2026

🔗 Related Issue

Closes #5072


📝 Summary

This PR fixes the observability session proliferation bug where each traced request was creating 4-6 independent database sessions, causing connection pool saturation under moderate load.

Problem: After PR #4696 optimized SQL instrumentation, the main observability lifecycle (start_trace → start_span → end_span → end_trace) still created 4-6 sessions per request by calling _get_or_create_observability_session() independently for each operation.

Solution: Implemented session reuse pattern where ObservabilityMiddleware creates a single database session at trace start and passes it through the entire lifecycle. All operations use commit=False for batching, with a final atomic commit=True in end_trace().

Impact:

  • Before: 4-6 database sessions per traced request
  • After: 1 database session per traced request
  • Reduction: 75-83% fewer sessions
  • Result: Prevents connection pool saturation, improves transaction consistency through atomic batching

🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🧪 Verification

Check Command Status
Lint suite make lint ✅ Pass
Unit tests make test ✅ 102 observability tests pass
Coverage ≥ 80% make coverage ✅ Covered

Test Results:

# Session reuse tests
uv run pytest tests/unit/mcpgateway/services/test_observability_session_reuse.py -v
# Result: 8 passed

# All observability tests
uv run pytest tests/unit/mcpgateway/services/test_observability_service.py tests/unit/mcpgateway/middleware/test_observability_middleware.py -v
# Result: 102 passed

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated (if applicable)
  • No secrets or credentials committed

📓 Notes

Changes Made

1. Service Layer (mcpgateway/services/observability_service.py):

2. Middleware Layer (mcpgateway/middleware/observability_middleware.py):

  • Rewrote dispatch() method (lines 107-309)
  • Creates single SessionLocal() session at line 161
  • Passes session through entire lifecycle:
    • start_trace(commit=False, obs_db=obs_db) - lines 164-175
    • start_span(commit=False, obs_db=obs_db) - lines 191-196
    • end_span(commit=False, obs_db=obs_db) - lines 227-234
    • end_trace(commit=True, obs_db=obs_db) - lines 240-249 (atomic commit)
  • Proper error handling with session cleanup in finally block (lines 304-309)

3. Test Coverage

Session Reuse Tests (tests/unit/mcpgateway/services/test_observability_session_reuse.py):

  • test_start_trace_with_session_reuse - Verifies start_trace accepts obs_db parameter and doesn't create new session
  • test_start_trace_without_session_creates_own - Verifies backward compatibility when obs_db=None
  • test_end_trace_with_session_reuse - Verifies end_trace accepts obs_db parameter and doesn't create new session
  • test_end_trace_without_session_creates_own - Verifies backward compatibility when obs_db=None
  • test_start_span_with_session_reuse - Verifies start_span session reuse (from PR fix(db): resolve database connection pool multiplication #4696)
  • test_end_span_with_session_reuse - Verifies end_span session reuse (from PR fix(db): resolve database connection pool multiplication #4696)
  • test_full_trace_lifecycle_with_single_session - Critical test: Verifies complete lifecycle (start_trace → start_span → end_span → end_trace) uses only 1 session with single atomic commit
  • test_add_event_with_session_reuse - Verifies event handling with session reuse

Middleware Tests (tests/unit/mcpgateway/middleware/test_observability_middleware.py):

  • test_dispatch_disabled - Verifies middleware skips when disabled
  • test_dispatch_health_check_skipped - Verifies health check paths are skipped
  • test_dispatch_trace_setup_success - Verifies successful trace lifecycle
  • test_dispatch_trace_setup_failure - Verifies graceful handling of trace setup failures
  • test_dispatch_exception_during_request - Verifies trace completion even when request fails
  • test_dispatch_trace_setup_cleanup_close_failure_logs_debug - Verifies debug logging for cleanup failures
  • test_dispatch_end_span_failure_logs_warning - Verifies warning logging for end_span failures
  • test_dispatch_session_close_failure_logs_debug - Verifies debug logging for session close failures in finally block
  • test_dispatch_end_trace_failure_logs_warning - Verifies warning logging for end_trace failures
  • test_dispatch_trace_setup_failure_with_session_close_failure - New: Verifies graceful handling when both trace setup AND session close fail (covers lines 213-214)
  • test_dispatch_with_user_context - Verifies user context extraction
  • test_dispatch_with_traceparent_header - Verifies W3C trace context propagation
  • test_dispatch_response_headers - Verifies trace ID in response headers
  • test_dispatch_status_code_handling - Verifies status code tracking

Test Results:

# Session reuse tests
uv run pytest tests/unit/mcpgateway/services/test_observability_session_reuse.py -v
# Result: 8 passed

# Middleware tests
uv run pytest tests/unit/mcpgateway/middleware/test_observability_middleware.py -v
# Result: 14 passed (including new test)

# All observability tests
uv run pytest tests/unit/mcpgateway/services/test_observability_service.py tests/unit/mcpgateway/middleware/test_observability_middleware.py -v
# Result: 102 passed

Coverage: The test suite now provides comprehensive coverage including:

  • Session reuse for all lifecycle methods
  • Backward compatibility when obs_db=None
  • Full trace lifecycle with single session
  • Atomic commit behavior (commit=False for batching, commit=True for final)
  • Session ownership (caller-owned sessions not closed by service)
  • Event handling with session reuse
  • Error handling edge cases (trace setup failure, session close failure, both failing simultaneously)
  • W3C trace context propagation
  • User context extraction
  • Response header injection

Backward Compatibility

The fix maintains full backward compatibility:

  • When obs_db=None (default), methods create their own independent sessions (original behavior)
  • When obs_db is provided, methods reuse the provided session (new optimized behavior)
  • Existing code continues to work without modifications

Design Decision: Atomic Batching

All observability operations within a trace lifecycle are batched into a single transaction with a final atomic commit. This provides:

Connection Pool Sizing Guidance

With this fix, the observability lifecycle uses 1 session per traced request (down from 4-6). Default pool configuration (DB_POOL_SIZE=200, DB_MAX_OVERFLOW=10) now supports ~200 concurrent traced requests instead of ~35-50, a 4-5x improvement in capacity.

…quest

Resolves #5072

- Modified start_trace() and end_trace() to accept obs_db and commit parameters
- Updated ObservabilityMiddleware to create single session for entire trace lifecycle
- All operations (start_trace, start_span, end_span, end_trace) now reuse same session
- Final end_trace() performs atomic commit of all batched operations
- Added 8 comprehensive unit tests for session reuse behavior
- Maintains backward compatibility when obs_db parameter not provided

Impact:
- Reduces database sessions from 4-6 to 1 per traced request (75-83% reduction)
- Prevents connection pool saturation under moderate concurrency
- Atomic batching improves transaction consistency

Test coverage:
- 8 new session reuse tests in test_observability_session_reuse.py
- All 102 observability tests pass
- Validates full lifecycle, backward compatibility, and error handling

Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
@bogdanmariusc10 bogdanmariusc10 added the bug Something isn't working label Jun 5, 2026
@bogdanmariusc10 bogdanmariusc10 added the performance Performance related items label Jun 5, 2026
@bogdanmariusc10 bogdanmariusc10 added observability Observability, logging, monitoring ica ICA related issues api REST API Related item labels Jun 5, 2026
Signed-off-by: Bogdan-Marius-Catanus <bogdan-marius.catanus@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api REST API Related item bug Something isn't working ica ICA related issues observability Observability, logging, monitoring performance Performance related items

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Observability session proliferation still creates 4-6 sessions per traced request

1 participant