Summary
During load testing with 4000 concurrent users (~1,800+ RPS), 182 instances of psycopg.errors.ProtocolViolation: idle transaction timeout were observed. All timeouts occurred on a single gateway replica (gateway-1), while gateway-2 and gateway-3 had zero timeouts.
Error Details
psycopg.errors.ProtocolViolation: idle transaction timeout
sqlalchemy.exc.OperationalError: (psycopg.errors.ProtocolViolation) idle transaction timeout
Root Cause Analysis
The idle transaction timeout is triggered by PgBouncer's IDLE_TRANSACTION_TIMEOUT=300s setting. This happens when:
- A FastAPI route handler acquires a database session via
Depends(get_db)
- The route makes a long-running async operation (e.g., MCP tool call taking 2-3+ seconds)
- Multiple such requests queue up, and some transactions sit idle beyond the 300s limit
- PgBouncer terminates the stale connection to protect the pool
Why only gateway-1?
Load balancer distribution may have sent more slow RPC requests to gateway-1 during the test window, causing connection accumulation.
Affected Endpoints
Based on load test error correlation:
/rpc fast-time-* (MCP tool calls)
/rpc fast-test-* (MCP tool calls)
/rpc resources/list, /rpc tools/list
/protocol/ping
Impact
- Error Rate: 0.034% of total requests (182 out of 537,040)
- User Impact: Low - requests fail with 500 error but system recovers
- Data Impact: None - transactions are rolled back cleanly
Potential Solutions
Option 1: Reduce database session scope (Recommended)
For endpoints that perform long-running MCP operations:
- Query database first, close session
- Perform MCP operation
- Open new session for any writes
Example pattern:
@router.post("/rpc")
async def execute_rpc(request: RPCRequest):
# Phase 1: Read from DB (short-lived session)
with get_session() as db:
tool = db.query(Tool).filter(Tool.id == request.tool_id).first()
tool_data = tool.to_dict() # Detach from session
# Phase 2: Long-running MCP call (no DB session held)
result = await mcp_pool.call_tool(tool_data, request.params)
# Phase 3: Write metrics (new short-lived session)
with get_session() as db:
db.add(ToolMetric(tool_id=tool_data["id"], ...))
db.commit()
return result
Option 2: Increase idle transaction timeout
# pgbouncer.ini
idle_transaction_timeout = 600 # Up from 300
Risk: May allow connection pool exhaustion under extreme load.
Option 3: Add async-safe session handling
Ensure all Depends(get_db) usages in async routes release sessions before any await that could take > 1 second.
Environment
- Load: 4000 concurrent users, 32 locust workers
- Gateway: 3 replicas × 24 workers each
- Database: PostgreSQL via PgBouncer (214 idle connections post-test)
- Configuration:
MCP_SESSION_POOL_MAX_PER_KEY=200
IDLE_TRANSACTION_TIMEOUT=300 (PgBouncer)
PGBOUNCER_POOL_SIZE=100
Acceptance Criteria
Labels
Summary
During load testing with 4000 concurrent users (~1,800+ RPS), 182 instances of
psycopg.errors.ProtocolViolation: idle transaction timeoutwere observed. All timeouts occurred on a single gateway replica (gateway-1), whilegateway-2andgateway-3had zero timeouts.Error Details
Root Cause Analysis
The idle transaction timeout is triggered by PgBouncer's
IDLE_TRANSACTION_TIMEOUT=300ssetting. This happens when:Depends(get_db)Why only gateway-1?
Load balancer distribution may have sent more slow RPC requests to gateway-1 during the test window, causing connection accumulation.
Affected Endpoints
Based on load test error correlation:
/rpc fast-time-*(MCP tool calls)/rpc fast-test-*(MCP tool calls)/rpc resources/list,/rpc tools/list/protocol/pingImpact
Potential Solutions
Option 1: Reduce database session scope (Recommended)
For endpoints that perform long-running MCP operations:
Example pattern:
Option 2: Increase idle transaction timeout
Risk: May allow connection pool exhaustion under extreme load.
Option 3: Add async-safe session handling
Ensure all
Depends(get_db)usages in async routes release sessions before anyawaitthat could take > 1 second.Environment
MCP_SESSION_POOL_MAX_PER_KEY=200IDLE_TRANSACTION_TIMEOUT=300(PgBouncer)PGBOUNCER_POOL_SIZE=100Acceptance Criteria
Labels