Bug Report: _compose_restart_llama_server ignores subprocess return codes — silent model switch failures on native Linux/macOS
Severity: Medium
Category: Error Handling
Platform: Linux (systemd agent), macOS (launchd agent)
Confidence: Confirmed
Description
_compose_restart_llama_server() in dream-host-agent.py calls subprocess.run for docker compose stop, docker compose up, docker restart, and docker start without checking any return codes. If any Docker operation fails (e.g., permission issue, missing compose file, daemon error), the function returns None silently. The caller _do_model_activate then proceeds to the health-check loop, which times out after 5 minutes before rolling back — a very slow silent failure path.
Affected File(s)
dream-server/bin/dream-host-agent.py (L1444–1478, _compose_restart_llama_server)
Root Cause
All subprocess.run calls in the function use capture_output=True but do not inspect .returncode or raise on non-zero:
# dream-host-agent.py L1467-1471 (non-AMD path, compose available):
subprocess.run(["docker", "compose"] + compose_flags + ["stop", "llama-server"],
cwd=str(INSTALL_DIR), capture_output=True, timeout=120)
subprocess.run(["docker", "compose"] + compose_flags + ["up", "-d", "llama-server"],
cwd=str(INSTALL_DIR), capture_output=True, timeout=300)
# No returncode check — function returns None regardless
If compose stop fails, compose up still runs. If compose up fails, the model activation flow proceeds to a 5-minute health check loop before declaring failure and rolling back.
Platform Analysis
- macOS: Affected — this is the native-host path (
not _in_container). macOS Docker Desktop edge cases (daemon restart, resource pressure) can cause compose failures.
- Linux: Affected — primary path for systemd agent installs. Docker daemon errors (disk full, cgroup limits) would hit this path.
- Windows/WSL2: Not affected — Windows uses the
_in_container path (_recreate_llama_server), not this function.
Reproduction
- Run host agent natively on Linux (systemd) with a valid install.
- Temporarily make
docker compose up fail (e.g., set INSTALL_DIR to a bad path or break .compose-flags).
- Activate a model via
POST /v1/model/activate.
- Expected: immediate error response.
- Actual: agent waits full 5-minute health check loop, then rolls back with "Health check failed" — the actual Docker error is never surfaced in the response.
Impact
On native Linux/macOS installs, a Docker-layer failure during model switch produces a 5-minute hang instead of an immediate error. The original model may or may not be running after rollback. Operators see "health check failed" with no indication of the real cause (Docker error output is silently discarded).
Suggested Approach
Check .returncode after each subprocess.run call and return an error (or raise) immediately if non-zero. The stderr from the failed command should be included in the error detail passed back to _do_model_activate so the operator can diagnose the Docker-layer failure without reading logs.
Filed by automated Python auditor after full-sweep review of Python changes merged 2026-04-06 → 2026-04-11 on upstream/main @ c0600ca.
Bug Report: _compose_restart_llama_server ignores subprocess return codes — silent model switch failures on native Linux/macOS
Severity: Medium
Category: Error Handling
Platform: Linux (systemd agent), macOS (launchd agent)
Confidence: Confirmed
Description
_compose_restart_llama_server()indream-host-agent.pycallssubprocess.runfordocker compose stop,docker compose up,docker restart, anddocker startwithout checking any return codes. If any Docker operation fails (e.g., permission issue, missing compose file, daemon error), the function returnsNonesilently. The caller_do_model_activatethen proceeds to the health-check loop, which times out after 5 minutes before rolling back — a very slow silent failure path.Affected File(s)
dream-server/bin/dream-host-agent.py(L1444–1478,_compose_restart_llama_server)Root Cause
All
subprocess.runcalls in the function usecapture_output=Truebut do not inspect.returncodeor raise on non-zero:If
compose stopfails,compose upstill runs. Ifcompose upfails, the model activation flow proceeds to a 5-minute health check loop before declaring failure and rolling back.Platform Analysis
not _in_container). macOS Docker Desktop edge cases (daemon restart, resource pressure) can cause compose failures._in_containerpath (_recreate_llama_server), not this function.Reproduction
docker compose upfail (e.g., setINSTALL_DIRto a bad path or break.compose-flags).POST /v1/model/activate.Impact
On native Linux/macOS installs, a Docker-layer failure during model switch produces a 5-minute hang instead of an immediate error. The original model may or may not be running after rollback. Operators see "health check failed" with no indication of the real cause (Docker error output is silently discarded).
Suggested Approach
Check
.returncodeafter eachsubprocess.runcall and return an error (or raise) immediately if non-zero. The stderr from the failed command should be included in the error detail passed back to_do_model_activateso the operator can diagnose the Docker-layer failure without reading logs.Filed by automated Python auditor after full-sweep review of Python changes merged 2026-04-06 → 2026-04-11 on upstream/main @ c0600ca.