🔗 Related Issues
Follow-up to #3813 (transaction control fix)
Related to #3731 (transaction management violation)
📝 Summary
Problem: ObservabilityService._safe_commit() eagerly commits observability data (traces, spans, events) independently of the main request transaction managed by get_db(). This creates orphaned observability records when errors occur after observability commits but before the main transaction completes.
Impact: Data consistency issue in observability records, particularly for streaming endpoints where late failures can occur after observability data has been committed.
🐞 Bug Details
Current Behavior
In mcpgateway/services/observability_service.py, the _safe_commit() method is called by:
start_trace() - line 276
end_trace() - line 324
start_span() - line 409 (when commit=True)
end_span() - line 452 (when commit=True)
add_event() - line 636
record_token_usage() - line 711
record_metric() - line 1048
delete_old_traces() - line 1434
Each call to _safe_commit() commits the session immediately, independent of the main request transaction.
The Problem
Late-Failure Streaming Edge Case:
1. Request arrives → observability_middleware starts trace
2. _safe_commit() commits trace to DB ✅
3. Handler begins processing (e.g., streaming response)
4. Error occurs AFTER observability commit
5. get_db() rolls back main transaction ❌
6. Observability records remain in DB → orphaned data
Example Scenario:
# Middleware
start_trace(db, trace_id, ...)
_safe_commit(db, "start_trace") # ← Commits observability data
# Later in streaming handler
async def stream_response():
yield chunk1 # Success
yield chunk2 # Success
# Error occurs here
raise ValueError("Stream failed") # Main transaction rolls back
# Result: Trace shows "success" but request actually failed
Root Cause
Observability commits are eager and independent of main transaction control. The PR #3813 note acknowledged this:
"The observability service methods call _safe_commit() which commits the session eagerly. This is pre-existing behavior (untouched by this PR) and does not undermine the fix — for the error case, get_db() rolls back before _safe_commit runs on end_span/end_trace. The late-failure streaming edge case is pre-existing and out of scope for this focused bug fix."
🎯 Expected Behavior
Observability data should be committed atomically with the main request transaction to ensure consistency.
💡 Potential Solutions
Option 1: Remove Eager Commits (Align with Transaction Boundary)
Remove _safe_commit() calls from observability methods and let get_db() handle all commits:
# In observability_service.py
def start_trace(...):
trace = ObservabilityTrace(...)
db.add(trace)
# NO COMMIT - let get_db() handle it
def end_trace(...):
trace.end_time = utc_now()
# NO COMMIT - let get_db() handle it
Pros:
Cons:
- Observability data lost if main transaction rolls back
- Could lose visibility into partial failures
Option 2: Separate Observability Session (Best-Effort Pattern)
Use a separate database session for observability to maintain independence:
def _get_observability_session() -> Session:
"""Create independent session for observability (best-effort)."""
return SessionLocal()
def start_trace(...):
with _get_observability_session() as obs_db:
trace = ObservabilityTrace(...)
obs_db.add(trace)
obs_db.commit() # Independent commit
Pros:
Cons:
- More complex
- Requires careful session management
Option 3: Deferred Observability Writes (Queue-Based)
Queue observability writes and commit them after main transaction:
def start_trace(...):
trace = ObservabilityTrace(...)
request_state.observability_queue.append(trace)
# Commit in middleware AFTER get_db() commits
# In middleware after get_db() commits
for item in request_state.observability_queue:
obs_db.add(item)
obs_db.commit()
Pros:
- Ensures main transaction completes first
- Preserves observability visibility
Cons:
- Most complex solution
- Requires queue infrastructure
🔍 Suggested Approach
Recommendation: Start with Option 1 (remove eager commits) as it's the simplest and most consistent with the transaction control fix from #3813.
If we determine that preserving observability visibility for partial failures is critical, evaluate Option 2 (separate session) as a follow-up, particularly in the context of #1952 (4-database architecture).
📋 Acceptance Criteria
🏷️ Labels
bug - Data consistency issue
observability - Affects observability system
database - Transaction management
triage - Needs priority/approach decision
📌 Notes
🧪 Test Plan
# Test case: Late failure after observability commit
async def test_late_failure_no_orphaned_observability(client, db):
"""Verify observability data is not orphaned on late failures."""
# Create streaming endpoint that fails mid-stream
@app.get("/test-stream")
async def stream_endpoint():
yield "chunk1"
yield "chunk2"
raise ValueError("Late failure")
# Make request
response = await client.get("/test-stream", stream=True)
# Verify: No orphaned traces in database
traces = db.query(ObservabilityTrace).all()
assert len(traces) == 0 or all(t.status == "error" for t in traces)
🔗 Related Issues
Follow-up to #3813 (transaction control fix)
Related to #3731 (transaction management violation)
📝 Summary
Problem:
ObservabilityService._safe_commit()eagerly commits observability data (traces, spans, events) independently of the main request transaction managed byget_db(). This creates orphaned observability records when errors occur after observability commits but before the main transaction completes.Impact: Data consistency issue in observability records, particularly for streaming endpoints where late failures can occur after observability data has been committed.
🐞 Bug Details
Current Behavior
In
mcpgateway/services/observability_service.py, the_safe_commit()method is called by:start_trace()- line 276end_trace()- line 324start_span()- line 409 (whencommit=True)end_span()- line 452 (whencommit=True)add_event()- line 636record_token_usage()- line 711record_metric()- line 1048delete_old_traces()- line 1434Each call to
_safe_commit()commits the session immediately, independent of the main request transaction.The Problem
Late-Failure Streaming Edge Case:
Example Scenario:
Root Cause
Observability commits are eager and independent of main transaction control. The PR #3813 note acknowledged this:
🎯 Expected Behavior
Observability data should be committed atomically with the main request transaction to ensure consistency.
💡 Potential Solutions
Option 1: Remove Eager Commits (Align with Transaction Boundary)
Remove
_safe_commit()calls from observability methods and letget_db()handle all commits:Pros:
Cons:
Option 2: Separate Observability Session (Best-Effort Pattern)
Use a separate database session for observability to maintain independence:
Pros:
Cons:
Option 3: Deferred Observability Writes (Queue-Based)
Queue observability writes and commit them after main transaction:
Pros:
Cons:
🔍 Suggested Approach
Recommendation: Start with Option 1 (remove eager commits) as it's the simplest and most consistent with the transaction control fix from #3813.
If we determine that preserving observability visibility for partial failures is critical, evaluate Option 2 (separate session) as a follow-up, particularly in the context of #1952 (4-database architecture).
📋 Acceptance Criteria
_safe_commit()calls removed from observability methods OR separate session pattern implemented🏷️ Labels
bug- Data consistency issueobservability- Affects observability systemdatabase- Transaction managementtriage- Needs priority/approach decision📌 Notes
🧪 Test Plan