Stdio MCP server gives up after sustained Docker daemon disruption

## Bug description

`RunWorkload`'s retry loop (in `pkg/workloads/manager.go`) can exhaust its
10-attempt retry budget over the course of 10–25 minutes when the Docker
daemon is repeatedly unresponsive, and permanently stop trying to keep an
stdio workload alive — even though every individual restart cycle ToolHive
attempted actually succeeded in starting the container. The user-facing
symptom: a previously-healthy MCP server goes offline and stays offline
until manually restarted.

The retry counter does not reset between successful runs. Every cycle in
which the container starts cleanly, runs for tens of seconds, then is killed
by the next disruption, counts equally toward the budget alongside genuine
failures-to-start. After ~10 such cycles, the manager logs `failed to
restart after max attempts, giving up` and leaves the workload in `error`
state.

I have observed this in my environment with OrbStack on macOS, where the
Docker socket periodically becomes unresponsive for several minutes at a
time. The container runtime emits errors like
`Error response from daemon: handle request: read response: unexpected EOF`
and
`read unix ->/.../orbstack/run/docker.sock: use of closed network connection`
across the affected window. Each of those connection drops kills the stdio
attach session, the container's PID 1 exits on stdin EOF, and ToolHive's
manager records a retry. After 10 such cycles, the manager gives up.

The bug is in the retry loop's accounting, not in any environment-specific
code path — the same loop runs for every workload regardless of runtime, so
any container runtime that exhibits sustained connection instability would
trigger the same outcome.

## Steps to reproduce

This requires sustained disruption — a single container kill is not enough,
because the retry budget is sized to absorb that. The bug needs ~10 cycles
where the container starts successfully, runs briefly, and is killed before
the disturbance ends.

```bash
# Start a stdio MCP server
thv run --transport stdio --name test-recovery mcp/time:latest

# Wait for it to be running
until docker ps --filter name=test-recovery --format '{{.Names}}' | grep -q test-recovery; do
  sleep 1
done

# Simulate a sustained Docker daemon disruption: kill the container shortly
# after every restart. After ~10 cycles spanning roughly 8 minutes, the
# manager exhausts its retry budget and gives up.
while true; do
  while ! docker ps --filter name=test-recovery --format '{{.Names}}' 2>/dev/null | grep -q test-recovery; do
    sleep 1
  done
  sleep 5
  docker kill test-recovery 2>/dev/null
done

# After ~8 minutes, in another terminal:
grep -E "running as detached|workload exited unexpectedly|failed to restart" \
  ~/Library/Application\ Support/toolhive/logs/test-recovery.log | tail -25
```

The log will show the retry counter climbing from `attempt 1/10` through
`attempt 10/10`, followed by `failed to restart after max attempts, giving up`.

## Expected behavior

A series of *successful* short-lived restarts caused by an external
disturbance should not be treated equivalently to a tight crash loop where
the workload cannot get started at all. Once the disturbance ends, the
manager should recover the workload.

## Actual behavior

After 10 attempts (each successful at starting the container, but each
interrupted by the next disruption), the manager logs `failed to restart
after max attempts, giving up`, sets the workload status to `error`, and
the workload stays offline until manually restarted via `thv restart` or
equivalent.

## Environment (if relevant)

- OS/version: macOS Darwin 25.3.0; container runtime: OrbStack 2.1.0 / 2.1.1
- ToolHive version: observed across multiple versions including v0.14.1 and v0.26.1.

## Additional context

Production log excerpt from one occurrence (slack server, 2026-03-30,
disruption lasting ~24 minutes):

```
23:32:18 INFO  running as detached process pid=23599
23:32:48 WARN  workload exited unexpectedly, restarting attempt=2 maxRetries=10
23:32:59 INFO  running as detached process pid=23599
... 7 more cycles, each running for 10–30s before next disruption ...
23:55:00 INFO  running as detached process pid=23599
23:55:15 WARN  workload exited unexpectedly, restarting attempt=9 maxRetries=10
23:55:16 INFO  restart attempt attempt=10 delay=60s
23:56:17 INFO  running as detached process pid=23599
23:56:27 WARN  workload exited unexpectedly, restarting attempt=10 maxRetries=10
23:56:28 ERROR failed to restart after max attempts, giving up
```

Each `running as detached process` line indicates a successful start. The
container ran for tens of seconds before being killed by the next
disruption. The retry counter incremented across these *successful* starts
and exhausted at attempt 10.

I have observed seven such failures in two months of ToolHive logs across
five stdio servers (atlassian, gitlab, google-workspace, pagerduty, slack).
Each ended with `failed to restart after max attempts, giving up`.

### Root cause

In `RunWorkload`'s retry loop (`pkg/workloads/manager.go`), the attempt
counter is incremented on every `ErrContainerExitedRestartNeeded` regardless
of how long the workload ran successfully before failing. There is no signal
distinguishing a failure inside a tight crash loop from a failure that came
after the workload had been running healthily.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stdio MCP server gives up after sustained Docker daemon disruption #5171

Bug description

Steps to reproduce

Expected behavior

Actual behavior

Environment (if relevant)

Additional context

Root cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stdio MCP server gives up after sustained Docker daemon disruption #5171

Description

Bug description

Steps to reproduce

Expected behavior

Actual behavior

Environment (if relevant)

Additional context

Root cause

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions