[BUG][OBSERVABILITY]: ObservabilityService _safe_commit bypasses main transaction control, causing orphaned records on late failures

## 🔗 Related Issues

Follow-up to #3813 (transaction control fix)
Related to #3731 (transaction management violation)

---

## 📝 Summary

**Problem**: `ObservabilityService._safe_commit()` eagerly commits observability data (traces, spans, events) independently of the main request transaction managed by `get_db()`. This creates orphaned observability records when errors occur after observability commits but before the main transaction completes.

**Impact**: Data consistency issue in observability records, particularly for streaming endpoints where late failures can occur after observability data has been committed.

---

## 🐞 Bug Details

### Current Behavior

In `mcpgateway/services/observability_service.py`, the `_safe_commit()` method is called by:
- `start_trace()` - line 276
- `end_trace()` - line 324  
- `start_span()` - line 409 (when `commit=True`)
- `end_span()` - line 452 (when `commit=True`)
- `add_event()` - line 636
- `record_token_usage()` - line 711
- `record_metric()` - line 1048
- `delete_old_traces()` - line 1434

Each call to `_safe_commit()` commits the session **immediately**, independent of the main request transaction.

### The Problem

**Late-Failure Streaming Edge Case**:

```
1. Request arrives → observability_middleware starts trace
2. _safe_commit() commits trace to DB ✅
3. Handler begins processing (e.g., streaming response)
4. Error occurs AFTER observability commit
5. get_db() rolls back main transaction ❌
6. Observability records remain in DB → orphaned data
```

**Example Scenario**:
```python
# Middleware
start_trace(db, trace_id, ...)  
_safe_commit(db, "start_trace")  # ← Commits observability data

# Later in streaming handler
async def stream_response():
    yield chunk1  # Success
    yield chunk2  # Success  
    # Error occurs here
    raise ValueError("Stream failed")  # Main transaction rolls back

# Result: Trace shows "success" but request actually failed
```

### Root Cause

Observability commits are eager and independent of main transaction control. The PR #3813 note acknowledged this:

> "The observability service methods call `_safe_commit()` which commits the session eagerly. This is pre-existing behavior (untouched by this PR) and does not undermine the fix — for the error case, `get_db()` rolls back before `_safe_commit` runs on `end_span`/`end_trace`. The late-failure streaming edge case is pre-existing and out of scope for this focused bug fix."

---

## 🎯 Expected Behavior

Observability data should be committed **atomically** with the main request transaction to ensure consistency.

---

## 💡 Potential Solutions

### Option 1: Remove Eager Commits (Align with Transaction Boundary)

Remove `_safe_commit()` calls from observability methods and let `get_db()` handle all commits:

```python
# In observability_service.py
def start_trace(...):
    trace = ObservabilityTrace(...)
    db.add(trace)
    # NO COMMIT - let get_db() handle it

def end_trace(...):
    trace.end_time = utc_now()
    # NO COMMIT - let get_db() handle it
```

**Pros**:
- Preserves data consistency
- Aligns with transaction control model from #3813
- Simple, clean solution

**Cons**:
- Observability data lost if main transaction rolls back
- Could lose visibility into partial failures

### Option 2: Separate Observability Session (Best-Effort Pattern)

Use a separate database session for observability to maintain independence:

```python
def _get_observability_session() -> Session:
    """Create independent session for observability (best-effort)."""
    return SessionLocal()

def start_trace(...):
    with _get_observability_session() as obs_db:
        trace = ObservabilityTrace(...)
        obs_db.add(trace)
        obs_db.commit()  # Independent commit
```

**Pros**:
- Maintains observability visibility even when main request fails
- Best-effort pattern (observability shouldn't block requests)
- Aligns with #1952 (4-database architecture for observability separation)

**Cons**:
- More complex
- Requires careful session management

### Option 3: Deferred Observability Writes (Queue-Based)

Queue observability writes and commit them after main transaction:

```python
def start_trace(...):
    trace = ObservabilityTrace(...)
    request_state.observability_queue.append(trace)
    # Commit in middleware AFTER get_db() commits

# In middleware after get_db() commits
for item in request_state.observability_queue:
    obs_db.add(item)
obs_db.commit()
```

**Pros**:
- Ensures main transaction completes first
- Preserves observability visibility

**Cons**:
- Most complex solution
- Requires queue infrastructure

---

## 🔍 Suggested Approach

**Recommendation**: Start with **Option 1** (remove eager commits) as it's the simplest and most consistent with the transaction control fix from #3813.

If we determine that preserving observability visibility for partial failures is critical, evaluate **Option 2** (separate session) as a follow-up, particularly in the context of #1952 (4-database architecture).

---

## 📋 Acceptance Criteria

- [ ] `_safe_commit()` calls removed from observability methods OR separate session pattern implemented
- [ ] Observability data commits aligned with main transaction boundary
- [ ] No orphaned observability records on request failures
- [ ] Tests added for streaming endpoint failure scenarios
- [ ] Documentation updated to explain observability transaction behavior

---

## 🏷️ Labels

- `bug` - Data consistency issue
- `observability` - Affects observability system
- `database` - Transaction management
- `triage` - Needs priority/approach decision

---

## 📌 Notes

- **Pre-existing issue**: Not introduced by #3813, but identified during that PR's review
- **Acknowledged in PR #3813**: Explicitly noted as out of scope for that focused bug fix
- **Affects streaming endpoints most**: SSE, WebSocket, and long-running HTTP requests where late failures can occur
- **Related to #1952**: 4-database architecture could provide natural separation for observability data

---

## 🧪 Test Plan

```python
# Test case: Late failure after observability commit
async def test_late_failure_no_orphaned_observability(client, db):
    """Verify observability data is not orphaned on late failures."""
    
    # Create streaming endpoint that fails mid-stream
    @app.get("/test-stream")
    async def stream_endpoint():
        yield "chunk1"
        yield "chunk2"
        raise ValueError("Late failure")
    
    # Make request
    response = await client.get("/test-stream", stream=True)
    
    # Verify: No orphaned traces in database
    traces = db.query(ObservabilityTrace).all()
    assert len(traces) == 0 or all(t.status == "error" for t in traces)
```

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][OBSERVABILITY]: ObservabilityService _safe_commit bypasses main transaction control, causing orphaned records on late failures #3883

🔗 Related Issues

📝 Summary

🐞 Bug Details

Current Behavior

The Problem

Root Cause

🎯 Expected Behavior

💡 Potential Solutions

Option 1: Remove Eager Commits (Align with Transaction Boundary)

Option 2: Separate Observability Session (Best-Effort Pattern)

Option 3: Deferred Observability Writes (Queue-Based)

🔍 Suggested Approach

📋 Acceptance Criteria

🏷️ Labels

📌 Notes

🧪 Test Plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG][OBSERVABILITY]: ObservabilityService _safe_commit bypasses main transaction control, causing orphaned records on late failures #3883

Description

🔗 Related Issues

📝 Summary

🐞 Bug Details

Current Behavior

The Problem

Root Cause

🎯 Expected Behavior

💡 Potential Solutions

Option 1: Remove Eager Commits (Align with Transaction Boundary)

Option 2: Separate Observability Session (Best-Effort Pattern)

Option 3: Deferred Observability Writes (Queue-Based)

🔍 Suggested Approach

📋 Acceptance Criteria

🏷️ Labels

📌 Notes

🧪 Test Plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions