Skip to content

bug: _compose_restart_llama_server ignores subprocess return codes — silent model switch failures on native Linux/macOS #314

@yasinBursali

Description

@yasinBursali

Bug Report: _compose_restart_llama_server ignores subprocess return codes — silent model switch failures on native Linux/macOS

Severity: Medium
Category: Error Handling
Platform: Linux (systemd agent), macOS (launchd agent)
Confidence: Confirmed

Description

_compose_restart_llama_server() in dream-host-agent.py calls subprocess.run for docker compose stop, docker compose up, docker restart, and docker start without checking any return codes. If any Docker operation fails (e.g., permission issue, missing compose file, daemon error), the function returns None silently. The caller _do_model_activate then proceeds to the health-check loop, which times out after 5 minutes before rolling back — a very slow silent failure path.

Affected File(s)

  • dream-server/bin/dream-host-agent.py (L1444–1478, _compose_restart_llama_server)

Root Cause

All subprocess.run calls in the function use capture_output=True but do not inspect .returncode or raise on non-zero:

# dream-host-agent.py L1467-1471 (non-AMD path, compose available):
subprocess.run(["docker", "compose"] + compose_flags + ["stop", "llama-server"],
               cwd=str(INSTALL_DIR), capture_output=True, timeout=120)
subprocess.run(["docker", "compose"] + compose_flags + ["up", "-d", "llama-server"],
               cwd=str(INSTALL_DIR), capture_output=True, timeout=300)
# No returncode check — function returns None regardless

If compose stop fails, compose up still runs. If compose up fails, the model activation flow proceeds to a 5-minute health check loop before declaring failure and rolling back.

Platform Analysis

  • macOS: Affected — this is the native-host path (not _in_container). macOS Docker Desktop edge cases (daemon restart, resource pressure) can cause compose failures.
  • Linux: Affected — primary path for systemd agent installs. Docker daemon errors (disk full, cgroup limits) would hit this path.
  • Windows/WSL2: Not affected — Windows uses the _in_container path (_recreate_llama_server), not this function.

Reproduction

  1. Run host agent natively on Linux (systemd) with a valid install.
  2. Temporarily make docker compose up fail (e.g., set INSTALL_DIR to a bad path or break .compose-flags).
  3. Activate a model via POST /v1/model/activate.
  4. Expected: immediate error response.
  5. Actual: agent waits full 5-minute health check loop, then rolls back with "Health check failed" — the actual Docker error is never surfaced in the response.

Impact

On native Linux/macOS installs, a Docker-layer failure during model switch produces a 5-minute hang instead of an immediate error. The original model may or may not be running after rollback. Operators see "health check failed" with no indication of the real cause (Docker error output is silently discarded).

Suggested Approach

Check .returncode after each subprocess.run call and return an error (or raise) immediately if non-zero. The stderr from the failed command should be included in the error detail passed back to _do_model_activate so the operator can diagnose the Docker-layer failure without reading logs.


Filed by automated Python auditor after full-sweep review of Python changes merged 2026-04-06 → 2026-04-11 on upstream/main @ c0600ca.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions