fix(host-agent): surface docker failures in _compose_restart_llama_server#906
Conversation
…rver _compose_restart_llama_server called subprocess.run four+ times for docker commands (compose stop/up, docker restart/stop/start) without inspecting returncode. Docker-layer failures (permission denied, missing compose file, daemon errors) were silently swallowed: _do_model_activate proceeded into the 5-minute health-check polling loop and only reported a generic "Health check failed — rolled back" with no indication of the real cause. Route all docker calls through a nested _run helper that captures stderr, checks returncode, and raises RuntimeError with the failing command + stderr tail on non-zero. The caller at _do_model_activate already wraps the path in `except Exception` and will now surface the docker error immediately. Native-host path only — Windows/WSL2 uses _recreate_llama_server which has its own returncode handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
693d580 to
d695fe5
Compare
Lightheartdevs
left a comment
There was a problem hiding this comment.
Aligns with CLAUDE.md's Let-It-Crash rule. Previously _compose_restart_llama_server used subprocess.run(...) with capture_output=True but discarded the result — a docker-layer failure (daemon down, image missing, compose flags stale) would silently succeed, and the caller only found out via the health-check timeout 30+ seconds later.
The _run() helper:
- Checks
returncode != 0and raisesRuntimeErrorwith the first 300 chars of stderr. - Truncates stderr to a sane size so log lines stay readable.
- Preserves
timeoutper subprocess call.
Downstream _do_model_activate now surfaces the error immediately instead of waiting for health checks to time out. Good UX improvement.
Minor: the ' '.join(argv[:3]) in the error message only shows the first three args (docker compose -f), which is fine for identification. Could include the operation (restart/up/stop) for clearer messages, but not blocking.
Ship.
What
_compose_restart_llama_servernow checks return codes on everysubprocess.runfor docker commands and raisesRuntimeErrorwith the captured stderr on non-zero exit.Why
All six docker calls in the function (
compose stop,compose up,compose restart,docker restart,docker stop,docker start) previously ran without inspecting returncode. If any failed — daemon hiccup, permission error, missing compose file, disk full — the function returned silently._do_model_activatethen proceeded into a 5-minute health-check polling loop and eventually reported the generic "Health check failed — rolled back" with no indication of the real cause. Operators saw a slow failure with no actionable error.How
Introduce a nested
_run(argv, timeout)helper inside the function that callssubprocess.runwithtext=Trueand raisesRuntimeError(f"{cmd} failed (exit N): {stderr[:300]}")on non-zero. Route all six docker calls through it. Update the docstring to document the raising contract.The caller
_do_model_activatealready wraps its body inexcept Exception as excand returns a 500 with the error message, so the real docker failure now surfaces immediately instead of the 5-minute health-check stall.Testing
Platform Impact
Review notes