Skip to content

fix(host-agent): surface docker failures in _compose_restart_llama_server#906

Merged
Lightheartdevs merged 1 commit intoLight-Heart-Labs:mainfrom
yasinBursali:fix/compose-restart-returncode-checks
Apr 18, 2026
Merged

fix(host-agent): surface docker failures in _compose_restart_llama_server#906
Lightheartdevs merged 1 commit intoLight-Heart-Labs:mainfrom
yasinBursali:fix/compose-restart-returncode-checks

Conversation

@yasinBursali
Copy link
Copy Markdown
Contributor

@yasinBursali yasinBursali commented Apr 11, 2026

Merge order: Merge before #905, #900, #908, #902, and #893 — all modify dream-host-agent.py.

What

_compose_restart_llama_server now checks return codes on every subprocess.run for docker commands and raises RuntimeError with the captured stderr on non-zero exit.

Why

All six docker calls in the function (compose stop, compose up, compose restart, docker restart, docker stop, docker start) previously ran without inspecting returncode. If any failed — daemon hiccup, permission error, missing compose file, disk full — the function returned silently. _do_model_activate then proceeded into a 5-minute health-check polling loop and eventually reported the generic "Health check failed — rolled back" with no indication of the real cause. Operators saw a slow failure with no actionable error.

How

Introduce a nested _run(argv, timeout) helper inside the function that calls subprocess.run with text=True and raises RuntimeError(f"{cmd} failed (exit N): {stderr[:300]}") on non-zero. Route all six docker calls through it. Update the docstring to document the raising contract.

The caller _do_model_activate already wraps its body in except Exception as exc and returns a 500 with the error message, so the real docker failure now surfaces immediately instead of the 5-minute health-check stall.

Testing

  • `python3 -m py_compile dream-server/bin/dream-host-agent.py` → clean
  • Manual test needed: trigger a model activate on a native-host install (Linux/macOS) while `docker compose up` is guaranteed to fail — e.g., temporarily `chmod 000 .env` or stop the docker daemon. Expected: immediate 500 with docker error in response, not a 5-minute wait before "Health check failed".
  • Rollback path (L1382) also calls this function — if rollback fails, the outer `except Exception` still catches the RuntimeError and returns 500. Acceptable per project's "Let It Crash" philosophy.

Platform Impact

  • macOS: affected — this is the native-host code path (launchd agent).
  • Linux: affected — native-host code path (systemd agent).
  • Windows/WSL2: not affected — WSL2 uses `_recreate_llama_server` which is gated on `_in_container` at `_do_model_activate:1322` and has its own returncode handling.

Review notes

  • Only the function body changes. No callers modified, no imports added.
  • No test coverage pre-existed for this function; a manual test is the only validation for the non-zero-return path.

…rver

_compose_restart_llama_server called subprocess.run four+ times for
docker commands (compose stop/up, docker restart/stop/start) without
inspecting returncode. Docker-layer failures (permission denied, missing
compose file, daemon errors) were silently swallowed: _do_model_activate
proceeded into the 5-minute health-check polling loop and only reported
a generic "Health check failed — rolled back" with no indication of the
real cause.

Route all docker calls through a nested _run helper that captures
stderr, checks returncode, and raises RuntimeError with the failing
command + stderr tail on non-zero. The caller at _do_model_activate
already wraps the path in `except Exception` and will now surface the
docker error immediately. Native-host path only — Windows/WSL2 uses
_recreate_llama_server which has its own returncode handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligns with CLAUDE.md's Let-It-Crash rule. Previously _compose_restart_llama_server used subprocess.run(...) with capture_output=True but discarded the result — a docker-layer failure (daemon down, image missing, compose flags stale) would silently succeed, and the caller only found out via the health-check timeout 30+ seconds later.

The _run() helper:

  • Checks returncode != 0 and raises RuntimeError with the first 300 chars of stderr.
  • Truncates stderr to a sane size so log lines stay readable.
  • Preserves timeout per subprocess call.

Downstream _do_model_activate now surfaces the error immediately instead of waiting for health checks to time out. Good UX improvement.

Minor: the ' '.join(argv[:3]) in the error message only shows the first three args (docker compose -f), which is fine for identification. Could include the operation (restart/up/stop) for clearer messages, but not blocking.

Ship.

@Lightheartdevs Lightheartdevs merged commit 7133a06 into Light-Heart-Labs:main Apr 18, 2026
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants