Skip to content

feat(mcp): fix CPU leak with mcp close#1456

Open
qbc2016 wants to merge 2 commits intoagentscope-ai:mainfrom
qbc2016:bc/mcp
Open

feat(mcp): fix CPU leak with mcp close#1456
qbc2016 wants to merge 2 commits intoagentscope-ai:mainfrom
qbc2016:bc/mcp

Conversation

@qbc2016
Copy link
Copy Markdown
Member

@qbc2016 qbc2016 commented Apr 9, 2026

AgentScope Version

1.0.19dev

Description

This PR fixes a critical CPU leak issue in MCP (Model Context Protocol) client management that occurs during workspace reload operations in uvicorn/FastAPI environments.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with pre-commit run --all-files command
  • All tests are passing
  • Docstrings are in Google style
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the MCP stateful client lifecycle management to avoid cross-task context-manager cleanup issues (notably anyio CancelScope “exit in a different task”) that can surface during uvicorn/FastAPI startup/shutdown and reload flows.

Changes:

  • Introduces a dedicated background lifecycle task that owns AsyncExitStack enter/exit for the MCP client session.
  • Reworks connect()/close() to coordinate lifecycle via asyncio events and await full cleanup through the background task.
  • Adds explanatory documentation in the class docstring about cross-task lifecycle handling.

Comment on lines +119 to +137
async def connect(self) -> None:
"""Connect to MCP server.

Spawns a background task that owns the full context-manager
lifecycle so that ``close()`` can be called from any task.
"""
if self.is_connected:
raise RuntimeError(
"The MCP server is already connected. Call close() "
"before connecting again.",
)

self._stop_event = asyncio.Event()
self._ready_event = asyncio.Event()
self._init_error = None

self._lifecycle_task = asyncio.create_task(
self._lifecycle_worker(),
)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces new cross-task lifecycle semantics (background task ownership + event signaling) intended to prevent the anyio CancelScope “exit in a different task” issue. Existing MCP tests cover basic connect()/close() but don’t cover calling close() from a different task or overlapping connect()/close(). Please add unit tests to lock in the new behavior and prevent regressions (e.g., connect() then asyncio.create_task(client.close()) from another task, ensuring no hang and is_connected transitions correctly).

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +137
if self.is_connected:
raise RuntimeError(
"The MCP server is already connected. Call close() "
"before connecting again.",
)

self._stop_event = asyncio.Event()
self._ready_event = asyncio.Event()
self._init_error = None

self._lifecycle_task = asyncio.create_task(
self._lifecycle_worker(),
)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connect() only checks is_connected, so two concurrent connect() calls can both proceed before _lifecycle_worker flips is_connected=True. The second call overwrites _stop_event/_ready_event/_lifecycle_task, leaving the first lifecycle task orphaned and potentially leaking resources. Consider guarding with _lifecycle_task is not None and not _lifecycle_task.done() (or an asyncio.Lock) and raising a clear error (or awaiting the in-flight connect).

Copilot uses AI. Check for mistakes.
await self._lifecycle_task
self._lifecycle_task = None
raise self._init_error

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a race where close() can be called before _ready_event is set: the worker will set _ready_event, immediately observe _stop_event already set, then exit and set is_connected=False. In that case connect() returns successfully even though the client is already closed. After waiting on _ready_event, consider validating that is_connected is still True (and/or _stop_event is not set / _lifecycle_task still running) and raise an error if shutdown happened during connect.

Suggested change
if (
not self.is_connected
or self._stop_event.is_set()
or self._lifecycle_task is None
or self._lifecycle_task.done()
):
if self._lifecycle_task is not None:
await self._lifecycle_task
self._lifecycle_task = None
raise RuntimeError(
"The MCP server was closed during connect().",
)

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +94
# Block until close() signals. The wait may also be
# interrupted by CancelledError if the session's internal
# anyio cancel scope is torn down; treat that as a stop.
try:
await self._stop_event.wait()
except (asyncio.CancelledError, Exception):
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_lifecycle_worker() swallows all exceptions from await self._stop_event.wait() via except (asyncio.CancelledError, Exception): pass. This can hide real programming errors (e.g., _stop_event being None) and make unexpected shutdowns silent. It’s safer to only treat cancellation as a stop signal and let unexpected exceptions propagate/log explicitly.

Suggested change
# Block until close() signals. The wait may also be
# interrupted by CancelledError if the session's internal
# anyio cancel scope is torn down; treat that as a stop.
try:
await self._stop_event.wait()
except (asyncio.CancelledError, Exception):
# Block until close() signals. The wait may also be
# interrupted by CancelledError if the session's internal
# anyio cancel scope is torn down; treat only cancellation
# as a stop signal and let unexpected errors propagate.
try:
await self._stop_event.wait()
except asyncio.CancelledError:
# Cancellation during shutdown is expected here.

Copilot uses AI. Check for mistakes.
Comment on lines 181 to 185
self._lifecycle_task = None
except Exception as e:
if not ignore_errors:
raise e
logger.warning("Error during MCP client cleanup: %s", e)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() doesn’t reset _lifecycle_task (and related lifecycle fields) if await self._lifecycle_task raises. That can leave the instance stuck in a state where future close() calls keep re-awaiting a failed task. Consider moving _lifecycle_task = None (and clearing _stop_event/_ready_event) into a finally block so state is consistent even on errors/cancellation.

Suggested change
self._lifecycle_task = None
except Exception as e:
if not ignore_errors:
raise e
logger.warning("Error during MCP client cleanup: %s", e)
except Exception as e:
if not ignore_errors:
raise e
logger.warning("Error during MCP client cleanup: %s", e)
finally:
self._lifecycle_task = None
self._stop_event = None
self._ready_event = None

Copilot uses AI. Check for mistakes.
Comment on lines 66 to 174
@@ -63,36 +83,106 @@ async def connect(self) -> None:
await self.session.initialize()

self.is_connected = True
self._ready_event.set()
logger.info("MCP client connected.")
except Exception:
await self.stack.aclose()
self.stack = None

# Block until close() signals. The wait may also be
# interrupted by CancelledError if the session's internal
# anyio cancel scope is torn down; treat that as a stop.
try:
await self._stop_event.wait()
except (asyncio.CancelledError, Exception):
pass

except Exception as e:
self._init_error = e
self._ready_event.set()
finally:
self.session = None
self.is_connected = False
self._cached_tools = None
if self.stack:
try:
await self.stack.aclose()
except Exception as e:
logger.warning(
"Error during MCP client cleanup: %s",
e,
)
finally:
self.stack = None

# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------

async def connect(self) -> None:
"""Connect to MCP server.

Spawns a background task that owns the full context-manager
lifecycle so that ``close()`` can be called from any task.
"""
if self.is_connected:
raise RuntimeError(
"The MCP server is already connected. Call close() "
"before connecting again.",
)

self._stop_event = asyncio.Event()
self._ready_event = asyncio.Event()
self._init_error = None

self._lifecycle_task = asyncio.create_task(
self._lifecycle_worker(),
)

try:
await self._ready_event.wait()
except BaseException:
# If connect() is cancelled externally (e.g. asyncio.wait_for
# timeout), ensure the lifecycle worker is stopped. We must
# cancel the task (not just set _stop_event) because the worker
# may still be blocked inside enter_async_context().
self._lifecycle_task.cancel()
try:
await self._lifecycle_task
except (asyncio.CancelledError, Exception):
pass
self._lifecycle_task = None
raise

if self._init_error is not None:
await self._lifecycle_task
self._lifecycle_task = None
raise self._init_error

async def close(self, ignore_errors: bool = True) -> None:
"""Clean up the MCP client resources. You must call this method when
your application is done.

Signals the background lifecycle task to exit and waits for full
cleanup.

Args:
ignore_errors (`bool`):
Whether to ignore errors during cleanup. Defaults to `True`.
"""
if not self.is_connected:
if not self.is_connected and self._lifecycle_task is None:
raise RuntimeError(
"The MCP server is not connected. Call connect() before "
"closing.",
)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New/updated docstrings (_lifecycle_worker(), connect(), close()) don’t follow the repository’s docstring template (description + explicit Args/Returns sections with typed backticks). Please align these docstrings with the project’s standard so generated docs stay consistent.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants