Skip to content

[BUG][OBSERVABILITY]: ObservabilityService _safe_commit bypasses main transaction control, causing orphaned records on late failures #3883

@MohanLaksh

Description

@MohanLaksh

🔗 Related Issues

Follow-up to #3813 (transaction control fix)
Related to #3731 (transaction management violation)


📝 Summary

Problem: ObservabilityService._safe_commit() eagerly commits observability data (traces, spans, events) independently of the main request transaction managed by get_db(). This creates orphaned observability records when errors occur after observability commits but before the main transaction completes.

Impact: Data consistency issue in observability records, particularly for streaming endpoints where late failures can occur after observability data has been committed.


🐞 Bug Details

Current Behavior

In mcpgateway/services/observability_service.py, the _safe_commit() method is called by:

  • start_trace() - line 276
  • end_trace() - line 324
  • start_span() - line 409 (when commit=True)
  • end_span() - line 452 (when commit=True)
  • add_event() - line 636
  • record_token_usage() - line 711
  • record_metric() - line 1048
  • delete_old_traces() - line 1434

Each call to _safe_commit() commits the session immediately, independent of the main request transaction.

The Problem

Late-Failure Streaming Edge Case:

1. Request arrives → observability_middleware starts trace
2. _safe_commit() commits trace to DB ✅
3. Handler begins processing (e.g., streaming response)
4. Error occurs AFTER observability commit
5. get_db() rolls back main transaction ❌
6. Observability records remain in DB → orphaned data

Example Scenario:

# Middleware
start_trace(db, trace_id, ...)  
_safe_commit(db, "start_trace")  # ← Commits observability data

# Later in streaming handler
async def stream_response():
    yield chunk1  # Success
    yield chunk2  # Success  
    # Error occurs here
    raise ValueError("Stream failed")  # Main transaction rolls back

# Result: Trace shows "success" but request actually failed

Root Cause

Observability commits are eager and independent of main transaction control. The PR #3813 note acknowledged this:

"The observability service methods call _safe_commit() which commits the session eagerly. This is pre-existing behavior (untouched by this PR) and does not undermine the fix — for the error case, get_db() rolls back before _safe_commit runs on end_span/end_trace. The late-failure streaming edge case is pre-existing and out of scope for this focused bug fix."


🎯 Expected Behavior

Observability data should be committed atomically with the main request transaction to ensure consistency.


💡 Potential Solutions

Option 1: Remove Eager Commits (Align with Transaction Boundary)

Remove _safe_commit() calls from observability methods and let get_db() handle all commits:

# In observability_service.py
def start_trace(...):
    trace = ObservabilityTrace(...)
    db.add(trace)
    # NO COMMIT - let get_db() handle it

def end_trace(...):
    trace.end_time = utc_now()
    # NO COMMIT - let get_db() handle it

Pros:

Cons:

  • Observability data lost if main transaction rolls back
  • Could lose visibility into partial failures

Option 2: Separate Observability Session (Best-Effort Pattern)

Use a separate database session for observability to maintain independence:

def _get_observability_session() -> Session:
    """Create independent session for observability (best-effort)."""
    return SessionLocal()

def start_trace(...):
    with _get_observability_session() as obs_db:
        trace = ObservabilityTrace(...)
        obs_db.add(trace)
        obs_db.commit()  # Independent commit

Pros:

Cons:

  • More complex
  • Requires careful session management

Option 3: Deferred Observability Writes (Queue-Based)

Queue observability writes and commit them after main transaction:

def start_trace(...):
    trace = ObservabilityTrace(...)
    request_state.observability_queue.append(trace)
    # Commit in middleware AFTER get_db() commits

# In middleware after get_db() commits
for item in request_state.observability_queue:
    obs_db.add(item)
obs_db.commit()

Pros:

  • Ensures main transaction completes first
  • Preserves observability visibility

Cons:

  • Most complex solution
  • Requires queue infrastructure

🔍 Suggested Approach

Recommendation: Start with Option 1 (remove eager commits) as it's the simplest and most consistent with the transaction control fix from #3813.

If we determine that preserving observability visibility for partial failures is critical, evaluate Option 2 (separate session) as a follow-up, particularly in the context of #1952 (4-database architecture).


📋 Acceptance Criteria

  • _safe_commit() calls removed from observability methods OR separate session pattern implemented
  • Observability data commits aligned with main transaction boundary
  • No orphaned observability records on request failures
  • Tests added for streaming endpoint failure scenarios
  • Documentation updated to explain observability transaction behavior

🏷️ Labels

  • bug - Data consistency issue
  • observability - Affects observability system
  • database - Transaction management
  • triage - Needs priority/approach decision

📌 Notes


🧪 Test Plan

# Test case: Late failure after observability commit
async def test_late_failure_no_orphaned_observability(client, db):
    """Verify observability data is not orphaned on late failures."""
    
    # Create streaming endpoint that fails mid-stream
    @app.get("/test-stream")
    async def stream_endpoint():
        yield "chunk1"
        yield "chunk2"
        raise ValueError("Late failure")
    
    # Make request
    response = await client.get("/test-stream", stream=True)
    
    # Verify: No orphaned traces in database
    traces = db.query(ObservabilityTrace).all()
    assert len(traces) == 0 or all(t.status == "error" for t in traces)

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasebugSomething isn't workingdatabaseobservabilityObservability, logging, monitoring

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions