You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The MCP session pool (mcpgateway/services/mcp_session_pool.py) causes ~20-45% of tool calls proxied through /servers/{id}/mcp to fail with ToolInvocationError("Tool invocation failed: "). The failure is consistently reproduced across all backend servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE), and even sequential single-user calls. Disabling the pool (MCP_SESSION_POOL_ENABLED=false) eliminates the failures entirely. No pool configuration parameter reduces the failure rate.
The root cause is an architectural mismatch: _create_session manually calls transport_ctx.__aenter__() and session.__aenter__(), which attaches anyio cancel scopes to the HTTP request handler task. When a child task in the transport's internal TaskGroup fails, it cancels the host task, killing in-progress call_tool() operations.
Relation to #3520: This issue provides a deeper root cause analysis. #3520 identified broken session recycling as the symptom; this issue identifies the cancel scope leak as the underlying cause. PR #3605 partially addresses #3520 but does not fix the cancel scope issue.
# Restart gateway with pool disabled
docker compose stop gateway
MCP_SESSION_POOL_ENABLED=false docker compose up -d gateway
# Wait for healthy, then repeat the same 30 calls — all succeed
Expected Behavior
All 30 sequential tool calls should succeed (as they do with pool disabled or when calling the backend server directly).
The MCP SDK handles session reuse correctly when used directly. The bug is in the gateway's pool.
Configuration Sweep (No Config Helps)
Every pool parameter was tested. None significantly reduce failures:
Config
Result (30 calls)
Baseline (defaults)
26/30 (4 fail)
HEALTH_CHECK_INTERVAL=0
24/30 (6 fail)
TTL=1s
25/30 (5 fail)
MAX_PER_KEY=1
25/30 (5 fail)
EXPLICIT_HEALTH_RPC=true
27/30 (3 fail)
HEALTH_METHODS=[list_tools]
27/30 (3 fail)
INTERVAL=0 + METHODS=[list_tools]
27/30 (3 fail)
INTERVAL=0 + METHODS=[ping]
26/30 (4 fail)
TTL=0 (never reuse)
23/30 (7 fail — worse)
POOL_ENABLED=false
30/30 (0 fail)
TTL=0 makes things worse because forcing fresh session creation on every call increases the window for cancel scope conflicts.
Root Cause Analysis
The cancel scope leak
_create_session (mcp_session_pool.py:1137, 1151) manually enters transport and session contexts:
# Line 1137 — enters transport TaskGroup cancel scope on the request handler taskread_stream, write_stream, _=awaittransport_ctx.__aenter__()
# Line 1151 — enters session TaskGroup cancel scope on the request handler taskawaitsession.__aenter__()
This attaches anyio cancel scopes to the HTTP request handler task:
post_writer (SDK streamable_http.py) spawns handle_request_async tasks with no try/except:
asyncdefhandle_request_async():
awaitself._handle_post_request(ctx) # NO error handlingifisinstance(message.root, JSONRPCRequest):
tg.start_soon(handle_request_async) # Spawned in transport TaskGroup
If _handle_post_request raises (HTTP error, connection error), the exception propagates to the TaskGroup, which cancels the transport scope — and with it, the host task (the request handler):
The host task (running call_tool()) receives CancelledError. asyncio.wait_for does NOT convert it to TimeoutError (its own timeout didn't fire). The error surfaces as ClosedResourceError or CancelledError.
Why the non-pool path survives
Without pooling, async with context managers properly unwind cancel scopes via __aexit__:
In the pool path, scopes A and B are ENTERED but __aexit__ is deferred. The cancel scope hierarchy leaks onto the host task, allowing child task failures to cancel the request handler directly.
With pool enabled, gateway logs Stateless session crashed + RuntimeError: cancel scope on failing requests. With pool disabled, these entries did not appear in testing. The noise appears coupled to pooled-session failures rather than appearing on successful pool-disabled runs.
#415 — asyncio.wait_for incompatible with anyio cancel scopes
OPEN
Directly relevant
#787 — Child tasks don't cancel group scope correctly on asyncio
OPEN
Timing issues
Recommended Fix Options
Dedicated background task per pooled session (most likely architectural fix; not yet validated): Run transport/session lifecycle in a dedicated asyncio.create_task() per pooled session, keeping the request handler task outside the transport's cancel scope hierarchy.
The /rpc endpoint shares the same backend pool path (tool_service.invoke_tool) and may be exposed, but the same failure rate was not reliably reproduced on /rpc in repeated runs.
The RCA document is at todo/rca-echo-failure.md in the repository.
Bug Summary
The MCP session pool (
mcpgateway/services/mcp_session_pool.py) causes ~20-45% of tool calls proxied through/servers/{id}/mcpto fail withToolInvocationError("Tool invocation failed: "). The failure is consistently reproduced across all backend servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE), and even sequential single-user calls. Disabling the pool (MCP_SESSION_POOL_ENABLED=false) eliminates the failures entirely. No pool configuration parameter reduces the failure rate.The root cause is an architectural mismatch:
_create_sessionmanually callstransport_ctx.__aenter__()andsession.__aenter__(), which attaches anyio cancel scopes to the HTTP request handler task. When a child task in the transport's internal TaskGroup fails, it cancels the host task, killing in-progresscall_tool()operations.Relation to #3520: This issue provides a deeper root cause analysis. #3520 identified broken session recycling as the symptom; this issue identifies the cancel scope leak as the underlying cause. PR #3605 partially addresses #3520 but does not fix the cancel scope issue.
Affected Component
mcpgateway- APISteps to Reproduce
Minimal curl reproduction (no load test needed):
With pool disabled (100% success):
Expected Behavior
All 30 sequential tool calls should succeed (as they do with pool disabled or when calling the backend server directly).
Logs / Error Output
Client-visible error:
{"jsonrpc":"2.0","id":3,"result":{"content":[{"type":"text","text":"Tool invocation failed: "}],"isError":true}}Gateway stack trace:
Comprehensive Test Matrix
Pool ENABLED (30 sequential calls each)
Pool DISABLED: ALL servers, ALL tools → 30/30 (100%)
Direct to backend (bypass gateway): 20/20 (100%)
SDK isolation tests
The MCP SDK handles session reuse correctly when used directly. The bug is in the gateway's pool.
Configuration Sweep (No Config Helps)
Every pool parameter was tested. None significantly reduce failures:
HEALTH_CHECK_INTERVAL=0TTL=1sMAX_PER_KEY=1EXPLICIT_HEALTH_RPC=trueHEALTH_METHODS=[list_tools]INTERVAL=0 + METHODS=[list_tools]INTERVAL=0 + METHODS=[ping]TTL=0(never reuse)POOL_ENABLED=falseTTL=0makes things worse because forcing fresh session creation on every call increases the window for cancel scope conflicts.Root Cause Analysis
The cancel scope leak
_create_session(mcp_session_pool.py:1137, 1151) manually enters transport and session contexts:This attaches anyio cancel scopes to the HTTP request handler task:
How it kills tool calls
post_writer(SDKstreamable_http.py) spawnshandle_request_asynctasks with no try/except:If
_handle_post_requestraises (HTTP error, connection error), the exception propagates to the TaskGroup, which cancels the transport scope — and with it, the host task (the request handler):The host task (running
call_tool()) receivesCancelledError.asyncio.wait_fordoes NOT convert it toTimeoutError(its own timeout didn't fire). The error surfaces asClosedResourceErrororCancelledError.Why the non-pool path survives
Without pooling,
async withcontext managers properly unwind cancel scopes via__aexit__:In the pool path, scopes A and B are ENTERED but
__aexit__is deferred. The cancel scope hierarchy leaks onto the host task, allowing child task failures to cancel the request handler directly.Relationship to #3520 and PR #3605
is_closeddetection. Thediscard=Truelogic in thesession()context manager is already on main (lines 1894-1900).Cancel Scope Cleanup Noise
With pool enabled, gateway logs
Stateless session crashed+RuntimeError: cancel scopeon failing requests. With pool disabled, these entries did not appear in testing. The noise appears coupled to pooled-session failures rather than appearing on successful pool-disabled runs.Related Upstream Issues
MCP Python SDK (modelcontextprotocol/python-sdk):
BaseSession.__aenter__()creates TaskGroup cancel scopes bound to the current taskcall_tool()hangs forever after connection lossanyio (agronholm/anyio):
Recommended Fix Options
Dedicated background task per pooled session (most likely architectural fix; not yet validated): Run transport/session lifecycle in a dedicated
asyncio.create_task()per pooled session, keeping the request handler task outside the transport's cancel scope hierarchy.Merge PR fix(session-pool): prevent broken session recycling in MCPSessionPool #3605 (partial fix): Transport-aware
is_closedcatches sessions broken between calls.Replace
asyncio.wait_forwithanyio.fail_after(complementary): Prevents cancel scope corruption on timeout paths. Secondary, not sufficient alone.Disable pool (proven workaround):
MCP_SESSION_POOL_ENABLED=false. 100% success. Performance cost: extra initialize round-trip per call.Workaround
Set
MCP_SESSION_POOL_ENABLED=falsein.envordocker-compose.yml.Environment Info
Additional Context
/rpcendpoint shares the same backend pool path (tool_service.invoke_tool) and may be exposed, but the same failure rate was not reliably reproduced on/rpcin repeated runs.todo/rca-echo-failure.mdin the repository.