[BUG][PERFORMANCE]: Idle transaction timeout under high load (4000+ users)

## Summary

During load testing with 4000 concurrent users (~1,800+ RPS), **182 instances** of `psycopg.errors.ProtocolViolation: idle transaction timeout` were observed. All timeouts occurred on a single gateway replica (`gateway-1`), while `gateway-2` and `gateway-3` had zero timeouts.

## Error Details

```
psycopg.errors.ProtocolViolation: idle transaction timeout
sqlalchemy.exc.OperationalError: (psycopg.errors.ProtocolViolation) idle transaction timeout
```

## Root Cause Analysis

The idle transaction timeout is triggered by PgBouncer's `IDLE_TRANSACTION_TIMEOUT=300s` setting. This happens when:

1. A FastAPI route handler acquires a database session via `Depends(get_db)`
2. The route makes a long-running async operation (e.g., MCP tool call taking 2-3+ seconds)
3. Multiple such requests queue up, and some transactions sit idle beyond the 300s limit
4. PgBouncer terminates the stale connection to protect the pool

### Why only gateway-1?

Load balancer distribution may have sent more slow RPC requests to gateway-1 during the test window, causing connection accumulation.

## Affected Endpoints

Based on load test error correlation:
- `/rpc fast-time-*` (MCP tool calls)
- `/rpc fast-test-*` (MCP tool calls)
- `/rpc resources/list`, `/rpc tools/list`
- `/protocol/ping`

## Impact

- **Error Rate**: 0.034% of total requests (182 out of 537,040)
- **User Impact**: Low - requests fail with 500 error but system recovers
- **Data Impact**: None - transactions are rolled back cleanly

## Potential Solutions

### Option 1: Reduce database session scope (Recommended)

For endpoints that perform long-running MCP operations:
1. Query database first, close session
2. Perform MCP operation
3. Open new session for any writes

Example pattern:
```python
@router.post("/rpc")
async def execute_rpc(request: RPCRequest):
    # Phase 1: Read from DB (short-lived session)
    with get_session() as db:
        tool = db.query(Tool).filter(Tool.id == request.tool_id).first()
        tool_data = tool.to_dict()  # Detach from session
    
    # Phase 2: Long-running MCP call (no DB session held)
    result = await mcp_pool.call_tool(tool_data, request.params)
    
    # Phase 3: Write metrics (new short-lived session)
    with get_session() as db:
        db.add(ToolMetric(tool_id=tool_data["id"], ...))
        db.commit()
    
    return result
```

### Option 2: Increase idle transaction timeout

```ini
# pgbouncer.ini
idle_transaction_timeout = 600  # Up from 300
```

Risk: May allow connection pool exhaustion under extreme load.

### Option 3: Add async-safe session handling

Ensure all `Depends(get_db)` usages in async routes release sessions before any `await` that could take > 1 second.

## Environment

- **Load**: 4000 concurrent users, 32 locust workers
- **Gateway**: 3 replicas × 24 workers each
- **Database**: PostgreSQL via PgBouncer (214 idle connections post-test)
- **Configuration**:
  - `MCP_SESSION_POOL_MAX_PER_KEY=200`
  - `IDLE_TRANSACTION_TIMEOUT=300` (PgBouncer)
  - `PGBOUNCER_POOL_SIZE=100`

## Acceptance Criteria

- [ ] Identify specific route handlers holding sessions during MCP calls
- [ ] Refactor to minimize session scope
- [ ] Re-run load test with 4000 users and verify < 10 idle transaction timeouts

## Labels

- bug
- performance
- database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][PERFORMANCE]: Idle transaction timeout under high load (4000+ users) #2550

Summary

Error Details

Root Cause Analysis

Why only gateway-1?

Affected Endpoints

Impact

Potential Solutions

Option 1: Reduce database session scope (Recommended)

Option 2: Increase idle transaction timeout

Option 3: Add async-safe session handling

Environment

Acceptance Criteria

Labels

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG][PERFORMANCE]: Idle transaction timeout under high load (4000+ users) #2550

Description

Summary

Error Details

Root Cause Analysis

Why only gateway-1?

Affected Endpoints

Impact

Potential Solutions

Option 1: Reduce database session scope (Recommended)

Option 2: Increase idle transaction timeout

Option 3: Add async-safe session handling

Environment

Acceptance Criteria

Labels

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions