Skip to content

Stdio MCP server gives up after sustained Docker daemon disruption #5171

@gkatz2

Description

@gkatz2

Bug description

RunWorkload's retry loop (in pkg/workloads/manager.go) can exhaust its
10-attempt retry budget over the course of 10–25 minutes when the Docker
daemon is repeatedly unresponsive, and permanently stop trying to keep an
stdio workload alive — even though every individual restart cycle ToolHive
attempted actually succeeded in starting the container. The user-facing
symptom: a previously-healthy MCP server goes offline and stays offline
until manually restarted.

The retry counter does not reset between successful runs. Every cycle in
which the container starts cleanly, runs for tens of seconds, then is killed
by the next disruption, counts equally toward the budget alongside genuine
failures-to-start. After ~10 such cycles, the manager logs failed to restart after max attempts, giving up and leaves the workload in error
state.

I have observed this in my environment with OrbStack on macOS, where the
Docker socket periodically becomes unresponsive for several minutes at a
time. The container runtime emits errors like
Error response from daemon: handle request: read response: unexpected EOF
and
read unix ->/.../orbstack/run/docker.sock: use of closed network connection
across the affected window. Each of those connection drops kills the stdio
attach session, the container's PID 1 exits on stdin EOF, and ToolHive's
manager records a retry. After 10 such cycles, the manager gives up.

The bug is in the retry loop's accounting, not in any environment-specific
code path — the same loop runs for every workload regardless of runtime, so
any container runtime that exhibits sustained connection instability would
trigger the same outcome.

Steps to reproduce

This requires sustained disruption — a single container kill is not enough,
because the retry budget is sized to absorb that. The bug needs ~10 cycles
where the container starts successfully, runs briefly, and is killed before
the disturbance ends.

# Start a stdio MCP server
thv run --transport stdio --name test-recovery mcp/time:latest

# Wait for it to be running
until docker ps --filter name=test-recovery --format '{{.Names}}' | grep -q test-recovery; do
  sleep 1
done

# Simulate a sustained Docker daemon disruption: kill the container shortly
# after every restart. After ~10 cycles spanning roughly 8 minutes, the
# manager exhausts its retry budget and gives up.
while true; do
  while ! docker ps --filter name=test-recovery --format '{{.Names}}' 2>/dev/null | grep -q test-recovery; do
    sleep 1
  done
  sleep 5
  docker kill test-recovery 2>/dev/null
done

# After ~8 minutes, in another terminal:
grep -E "running as detached|workload exited unexpectedly|failed to restart" \
  ~/Library/Application\ Support/toolhive/logs/test-recovery.log | tail -25

The log will show the retry counter climbing from attempt 1/10 through
attempt 10/10, followed by failed to restart after max attempts, giving up.

Expected behavior

A series of successful short-lived restarts caused by an external
disturbance should not be treated equivalently to a tight crash loop where
the workload cannot get started at all. Once the disturbance ends, the
manager should recover the workload.

Actual behavior

After 10 attempts (each successful at starting the container, but each
interrupted by the next disruption), the manager logs failed to restart after max attempts, giving up, sets the workload status to error, and
the workload stays offline until manually restarted via thv restart or
equivalent.

Environment (if relevant)

  • OS/version: macOS Darwin 25.3.0; container runtime: OrbStack 2.1.0 / 2.1.1
  • ToolHive version: observed across multiple versions including v0.14.1 and v0.26.1.

Additional context

Production log excerpt from one occurrence (slack server, 2026-03-30,
disruption lasting ~24 minutes):

23:32:18 INFO  running as detached process pid=23599
23:32:48 WARN  workload exited unexpectedly, restarting attempt=2 maxRetries=10
23:32:59 INFO  running as detached process pid=23599
... 7 more cycles, each running for 10–30s before next disruption ...
23:55:00 INFO  running as detached process pid=23599
23:55:15 WARN  workload exited unexpectedly, restarting attempt=9 maxRetries=10
23:55:16 INFO  restart attempt attempt=10 delay=60s
23:56:17 INFO  running as detached process pid=23599
23:56:27 WARN  workload exited unexpectedly, restarting attempt=10 maxRetries=10
23:56:28 ERROR failed to restart after max attempts, giving up

Each running as detached process line indicates a successful start. The
container ran for tens of seconds before being killed by the next
disruption. The retry counter incremented across these successful starts
and exhausted at attempt 10.

I have observed seven such failures in two months of ToolHive logs across
five stdio servers (atlassian, gitlab, google-workspace, pagerduty, slack).
Each ended with failed to restart after max attempts, giving up.

Root cause

In RunWorkload's retry loop (pkg/workloads/manager.go), the attempt
counter is incremented on every ErrContainerExitedRestartNeeded regardless
of how long the workload ran successfully before failing. There is no signal
distinguishing a failure inside a tight crash loop from a failure that came
after the workload had been running healthily.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcliChanges that impact CLI functionalitygoPull requests that update go code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions