Skip to content

fix: prevent MCP response loss from stdout suppression race condition#43

Closed
anchapin wants to merge 1 commit intoNatLabRockies:developfrom
anchapin:fix/stdout-suppression-race
Closed

fix: prevent MCP response loss from stdout suppression race condition#43
anchapin wants to merge 1 commit intoNatLabRockies:developfrom
anchapin:fix/stdout-suppression-race

Conversation

@anchapin
Copy link
Copy Markdown
Collaborator

@anchapin anchapin commented Apr 9, 2026

Fixes #42

Problem

All MCP tools returned error -32001: Request timed out when any slow tool (create_new_building, create_baseline_osm, HVAC creation) ran concurrently with any other tool call — including trivial ones like get_server_status.

Root Cause

The FastMCP middleware held os.dup2() on process-global fd 1 (stdout) redirected to stderr for the entire duration of each tool call. FastMCP dispatches sync tools via anyio.to_thread.run_sync — multiple tools can run simultaneously. While Thread A held fd 1 → stderr, the asyncio stdout_writer task wrote Thread B's JSON-RPC response to fd 1 (stderr) → client received nothing → timeout.

Fix

  • Remove global middleware from server.py
  • Add threading.RLock to suppress_openstudio_warnings() so concurrent worker threads cannot interleave their fd 1 redirects
  • RLock (not Lock) allows same-thread nested re-entry (e.g. create_baseline_osm → set_constructions → add_hvac all on one thread)
  • Delete dead create_suppression_middleware() function
  • Move flush() inside try/finally to prevent fd leak on flush error
  • Restore targeted suppression to all 15+ SDK callsites that emit SWIG memory warnings: BCLMeasure(), Model(), DesignDay(), model.save(), VersionTranslator.loadModel(), and full HVAC creation blocks across 10 files
  • Add is_initialized() guard in baseline_model.set_constructions()

Test Results

  • Unit tests: 165 passed, 0 failures (pytest -m "not integration")
  • CI Shard 5 (integration): 89 passed in 5:48 (Docker)
    • includes new test_concurrent_tools.py: fires create_baseline_osm + get_server_status concurrently, asserts both return valid responses

Known Limitation

HVAC creation blocks (~2–10s) are the widest suppression windows. Only concurrent multi-client stdio usage could theoretically lose a response during HVAC tool calls. Sequential clients (Claude Desktop, Cursor) are unaffected.

Files Changed (13)

Area Change
server.py Remove middleware
stdout_suppression.py RLock, delete dead code, flush fix
model_management/operations.py 3 callsites
model_management/baseline_model.py set_constructions + add_hvac
hvac_systems/operations.py 6 HVAC creation functions
weather/operations.py DesignDay() constructor
common_measures/operations.py BCLMeasure() constructor
comstock/operations.py BCLMeasure() + Model()
geometry/operations.py model.save()
measures/operations.py BCLMeasure() + model.save()
measure_authoring/operations.py 4 callsites
tests/test_concurrent_tools.py New regression test
.github/workflows/ci.yml Add test to shard 5

Root cause: the FastMCP middleware held os.dup2() on process-global fd 1
(stdout) redirected to stderr for the entire duration of each tool call.
FastMCP dispatches sync tools via anyio.to_thread.run_sync, so concurrent
tool calls could run simultaneously. While Thread A held fd 1 → stderr,
the asyncio stdout_writer task (event loop thread) wrote Thread B's JSON-
RPC response to fd 1 — which was stderr — and the MCP client received
nothing, causing MCP error -32001 timeouts on all tools.

Fix:
- Remove global FastMCP middleware from server.py
- Add threading.RLock to suppress_openstudio_warnings() so concurrent
  worker threads cannot interleave their fd 1 redirects
- Use RLock (not Lock) to allow safe same-thread nested re-entry
  (e.g. create_baseline_osm → set_constructions → add_hvac)
- Delete dead create_suppression_middleware() function
- Move flush() inside try/finally to prevent fd leak on flush error
- Restore targeted suppression to all SDK callsites that emit SWIG
  memory warnings: BCLMeasure(), Model(), DesignDay(), model.save(),
  VersionTranslator.loadModel(), and full HVAC creation blocks across
  hvac_systems, weather, common_measures, comstock, geometry, measures,
  measure_authoring, model_management, and baseline_model
- Add is_initialized() guard in baseline_model.set_constructions()
- Add regression test (test_concurrent_tools.py, CI shard 5) that
  fires create_baseline_osm + get_server_status concurrently and
  asserts both return valid responses

Known limitation: HVAC creation blocks (~2-10s) are the widest
suppression windows. Sequential MCP clients (Claude Desktop, Cursor)
are unaffected; only concurrent multi-client stdio usage could
theoretically lose a response during HVAC tool calls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@brianlball
Copy link
Copy Markdown
Collaborator

brianlball commented Apr 10, 2026

good catch! see #46, 2395d95

@brianlball brianlball closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP error -32001: Request timed out on all tools — stdout suppression race condition

2 participants