Skip to content

bug: installer doesn't restart dream-host-agent.service after rewriting bin/dream-host-agent.py — leaves zombie reading deleted inode #334

@yasinBursali

Description

@yasinBursali

Bug Report: installer doesn't restart dream-host-agent.service after rewriting bin/dream-host-agent.py — leaves zombie reading deleted inode

Severity: Medium
Category: Installer / Lifecycle
Platform: Linux (systemd user services), Windows/WSL2 (same systemd path); macOS likely affected via launchd (not directly tested here)
Confidence: Confirmed on Linux/WSL2

Description

After bash install.sh --all --non-interactive completes successfully, the running dream-host-agent.service is still the OLD process from before the install: same PID, same start time, with /proc/<pid>/cwd → /home/rosenrot/dream-server (deleted) confirming it's reading the unlinked previous-install inode. The newly-rewritten code on disk is unused until the user manually runs systemctl --user restart dream-host-agent.service.

This is not caused by any specific PR in the current open stack — it appears to be pre-existing behavior on Light-Heart-Labs/DreamServer@c0600ca. But it MASKS the runtime testability of every PR that touches host-agent routes, which currently includes Light-Heart-Labs#893, Light-Heart-Labs#900, Light-Heart-Labs#905, Light-Heart-Labs#906, Light-Heart-Labs#907, Light-Heart-Labs#908. Filing it as a standalone issue because it cost me ~30 minutes of confused QA before I caught it, and it will cost the same to anyone else doing runtime validation of host-agent PRs.

Affected File(s)

  • dream-server/installers/phases/11-services.sh — final phase that puts the host-agent binary in place; missing a service-restart step.
  • dream-server/installers/phases/12-finalize.sh — alternative location for the restart hook.
  • ~/.config/systemd/user/dream-host-agent.service — the systemd user unit (already in place from a previous install on test systems).

Root Cause

The installer copies the new dream-host-agent.py into ~/dream-server/bin/ during phase 11 (or wherever the service binary is laid down), but never invokes systemctl --user restart dream-host-agent.service. The service is registered as a long-lived daemon, so:

  1. sudo rm -rf ~/dream-server unlinks the directory inode but doesn't kill the process — the kernel keeps the inode alive as long as a process holds an open reference (the running host-agent's CWD).
  2. bash install.sh --all --non-interactive creates a NEW ~/dream-server directory (new inode) at the same path, populated with the new code.
  3. The systemd-managed host-agent process is unchanged. Its CWD still points to the OLD (deleted) inode. Its open bin/dream-host-agent.py file descriptor still points to the OLD code.
  4. New routes added by recent PRs are unreachable because the OLD process running OLD code doesn't know them.

Evidence

Immediately after a successful bash install.sh --all --non-interactive on a system with a previously-running host-agent:

$ systemctl --user status dream-host-agent.service
● dream-host-agent.service - DreamServer Host Agent
     Loaded: loaded (/home/rosenrot/.config/systemd/user/dream-host-agent.service; enabled; preset: enabled)
     Active: active (running) since Sat 2026-04-11 03:07:28 +03; 21h ago
   Main PID: 75759 (python3)

Note: 21h uptime — that's the PID from BEFORE the install, not a new one.

$ sudo ls -la /proc/75759/cwd
lrwxrwxrwx ... /proc/75759/cwd -> /home/rosenrot/dream-server (deleted)

The cwd link confirms it's holding a deleted inode. Any new host-agent route added by a recent PR is unreachable. Example with PR Light-Heart-Labs#893's cancel:

$ curl -X POST -H "Authorization: Bearer $KEY" \
    http://127.0.0.1:3002/api/models/download/cancel
{"detail":"Not found"}

Host-agent journal shows the OLD process returning 404 because it doesn't know the route:

"POST /v1/model/download/cancel HTTP/1.1" 404 -

After systemctl --user restart dream-host-agent.service:

$ ps -o pid,lstart,cmd -p $(pgrep -f dream-host-agent.py)
    PID                  STARTED CMD
 353285 Sun Apr 12 00:11 2026   /usr/bin/python3 /home/rosenrot/dream-server/bin/dream-host-agent.py

$ ls -la /proc/353285/cwd
lrwxrwxrwx ... /proc/353285/cwd -> /home/rosenrot/DreamServer    # ← new, no "(deleted)"

$ curl -X POST -H "Authorization: Bearer $KEY" \
    http://127.0.0.1:3002/api/models/download/cancel
{"status":"no_download"}                                          # ← 200, works

$ journalctl --user -u dream-host-agent.service -n 5
"POST /v1/model/download/cancel HTTP/1.1" 200 -                   # ← new route reachable

Same exact dashboard-api call. Difference: the daemon was restarted to pick up the new binary.

Platform Analysis

  • Linux (systemd user services): Confirmed affected. This is the primary install path for native Linux installs.
  • Windows/WSL2: Confirmed affected — WSL2 inherits the systemd user-service path.
  • macOS (launchd): Likely affected by the same class of bug. macOS uses launchctl bootload/bootout to manage the host-agent plist. PR fix(macos-installer): dynamic launchd PATH + extensions-lib source Light-Heart-Labs/DreamServer#899 modifies installers/macos/install-macos.sh's launchctl handling but I could not directly verify the restart-after-rewrite behavior on Apple hardware. Worth checking — if the macOS installer also fails to restart the launchd unit after rewriting the binary, the same zombie pattern applies there.

Reproduction

  1. Have dream-host-agent.service (systemd user unit) already running from a previous install: systemctl --user status dream-host-agent.service shows active (running).
  2. sudo rm -rf ~/dream-server
  3. bash install.sh --all --non-interactive — install completes, reaches phase 13.
  4. Check the host-agent service: systemctl --user status dream-host-agent.service. The PID is the SAME as before the install. Start time is the same.
  5. sudo ls -la /proc/$(pgrep -f dream-host-agent.py)/cwd shows ... -> /home/rosenrot/dream-server (deleted).
  6. Any new host-agent route is unreachable. Verify with curl -X POST -H "Authorization: Bearer $DASHBOARD_API_KEY" http://127.0.0.1:3002/api/models/download/cancel. With the OLD daemon, response is {"detail":"Not found"}. After systemctl --user restart dream-host-agent.service, response is {"status":"no_download"} (200).

Impact

PRs Light-Heart-Labs#893 (cancel route), Light-Heart-Labs#905 (hook framework), Light-Heart-Labs#906 (compose-restart error surfacing), Light-Heart-Labs#907 (built-in extensions API), Light-Heart-Labs#908 (.env write via host agent), Light-Heart-Labs#900 (per-part SHA256 verify) all appear broken at runtime until the host-agent is manually restarted, even though their code is correctly merged on disk and the file at ~/dream-server/bin/dream-host-agent.py is the new version.

For end users: every host-agent enhancement that lands in a release will be effectively dead code until they restart the service or reboot. For QA: every reviewer who tries to validate a host-agent PR at runtime will see the OLD behavior unless they know to restart the daemon explicitly.

Suggested Approach

Add to dream-server/installers/phases/11-services.sh (after the binary is in place) or dream-server/installers/phases/12-finalize.sh:

# Restart the host-agent so it loads the new binary, not the inode of a deleted previous install.
if systemctl --user is-enabled dream-host-agent.service >/dev/null 2>&1; then
    log "Restarting dream-host-agent.service to load the new binary..."
    systemctl --user restart dream-host-agent.service \
        || warn "host-agent restart failed (non-fatal)"
elif command -v launchctl >/dev/null 2>&1 && [ -f "$HOME/Library/LaunchAgents/com.dreamserver.host-agent.plist" ]; then
    log "Reloading dream-host-agent launchd unit..."
    launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/com.dreamserver.host-agent.plist" 2>/dev/null || true
    launchctl bootstrap "gui/$(id -u)" "$HOME/Library/LaunchAgents/com.dreamserver.host-agent.plist" \
        || warn "host-agent launchd reload failed (non-fatal)"
fi

Make sure the restart happens after the binary is on disk, so the new process picks up the new code.

Bonus observation

On host-agent restart the new process logs:

WARNING: Agent is listening on all interfaces. Set DREAM_AGENT_BIND=127.0.0.1 in .env to restrict.

i.e. the host-agent is on 0.0.0.0:7710 by default after a fresh install. The 0.0.0.0 bind is intentional (was deliberately moved off 127.0.0.1 to fix Light-Heart-Labs/DreamServer#752 — host agent unreachable from Docker containers), but the warning suggests the installer isn't setting DREAM_AGENT_BIND=127.0.0.1 in .env as the safer default for environments where the container reachability isn't required. Possibly worth filing as a separate "secure default" issue, or wiring up the install flow so users get a sane default and an opt-out for container reachability.

Cross-references


Filed during full-stack integration test of open PR stack Light-Heart-Labs#893–909 on Light-Heart-Labs/DreamServer@c0600ca3. Environment: WSL2 / Ubuntu 24.04 / systemd user-mode services / NVIDIA RTX 3070 Laptop / Tier 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinginstallerInstaller issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions