Skip to content

[BUG][PERFORMANCE]: Idle transaction timeout under high load (4000+ users) #2550

@crivetimihai

Description

@crivetimihai

Summary

During load testing with 4000 concurrent users (~1,800+ RPS), 182 instances of psycopg.errors.ProtocolViolation: idle transaction timeout were observed. All timeouts occurred on a single gateway replica (gateway-1), while gateway-2 and gateway-3 had zero timeouts.

Error Details

psycopg.errors.ProtocolViolation: idle transaction timeout
sqlalchemy.exc.OperationalError: (psycopg.errors.ProtocolViolation) idle transaction timeout

Root Cause Analysis

The idle transaction timeout is triggered by PgBouncer's IDLE_TRANSACTION_TIMEOUT=300s setting. This happens when:

  1. A FastAPI route handler acquires a database session via Depends(get_db)
  2. The route makes a long-running async operation (e.g., MCP tool call taking 2-3+ seconds)
  3. Multiple such requests queue up, and some transactions sit idle beyond the 300s limit
  4. PgBouncer terminates the stale connection to protect the pool

Why only gateway-1?

Load balancer distribution may have sent more slow RPC requests to gateway-1 during the test window, causing connection accumulation.

Affected Endpoints

Based on load test error correlation:

  • /rpc fast-time-* (MCP tool calls)
  • /rpc fast-test-* (MCP tool calls)
  • /rpc resources/list, /rpc tools/list
  • /protocol/ping

Impact

  • Error Rate: 0.034% of total requests (182 out of 537,040)
  • User Impact: Low - requests fail with 500 error but system recovers
  • Data Impact: None - transactions are rolled back cleanly

Potential Solutions

Option 1: Reduce database session scope (Recommended)

For endpoints that perform long-running MCP operations:

  1. Query database first, close session
  2. Perform MCP operation
  3. Open new session for any writes

Example pattern:

@router.post("/rpc")
async def execute_rpc(request: RPCRequest):
    # Phase 1: Read from DB (short-lived session)
    with get_session() as db:
        tool = db.query(Tool).filter(Tool.id == request.tool_id).first()
        tool_data = tool.to_dict()  # Detach from session
    
    # Phase 2: Long-running MCP call (no DB session held)
    result = await mcp_pool.call_tool(tool_data, request.params)
    
    # Phase 3: Write metrics (new short-lived session)
    with get_session() as db:
        db.add(ToolMetric(tool_id=tool_data["id"], ...))
        db.commit()
    
    return result

Option 2: Increase idle transaction timeout

# pgbouncer.ini
idle_transaction_timeout = 600  # Up from 300

Risk: May allow connection pool exhaustion under extreme load.

Option 3: Add async-safe session handling

Ensure all Depends(get_db) usages in async routes release sessions before any await that could take > 1 second.

Environment

  • Load: 4000 concurrent users, 32 locust workers
  • Gateway: 3 replicas × 24 workers each
  • Database: PostgreSQL via PgBouncer (214 idle connections post-test)
  • Configuration:
    • MCP_SESSION_POOL_MAX_PER_KEY=200
    • IDLE_TRANSACTION_TIMEOUT=300 (PgBouncer)
    • PGBOUNCER_POOL_SIZE=100

Acceptance Criteria

  • Identify specific route handlers holding sessions during MCP calls
  • Refactor to minimize session scope
  • Re-run load test with 4000 users and verify < 10 idle transaction timeouts

Labels

  • bug
  • performance
  • database

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasebugSomething isn't workingdatabaseperformancePerformance related itemspythonPython / backend development (FastAPI)

Type

No fields configured for Task.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions