Skip to content

Concurrent server-side sampling requests are serialized end-to-end #4006

@demoray

Description

@demoray

Enhancement

When a FastMCP server tool issues multiple ctx.sample(...) calls concurrently (e.g. via asyncio.gather), the client only ever processes one at a time: the underlying MCP BaseSession._receive_loop awaits each incoming request handler inline before reading the next message off the stream, so sampling responses come back strictly serially even though the server-side awaits are concurrent.

Concrete use case: I have a tool that fans out dozens of independent ctx.sample calls per invocation against an LLM deployment with provisioned capacity sitting idle. The work is embarrassingly parallel and I have token budget to burn, but wall-clock time scales linearly with the fan-out because every sample blocks the next one. A minimal reproducer (5 concurrent ctx.sample calls with a 0.5s server-side sleep in the sampling handler) shows peak in-flight = 1 and elapsed ≈ N × per-call latency instead of ≈ per-call latency.

I'd like FastMCP to support concurrent handling of sampling (and ideally other client-side) requests so that asyncio.gather over ctx.sample actually fans out to the LLM, with whatever bounding/back-pressure knob the maintainers think is appropriate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    clientRelated to the FastMCP client SDK or client-side functionality.enhancementImprovement to existing functionality. For issues and smaller PR improvements.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions