bug: installer doesn't restart dream-host-agent.service after rewriting bin/dream-host-agent.py — leaves zombie reading deleted inode

## Bug Report: installer doesn't restart dream-host-agent.service after rewriting bin/dream-host-agent.py — leaves zombie reading deleted inode

**Severity:** Medium
**Category:** Installer / Lifecycle
**Platform:** Linux (systemd user services), Windows/WSL2 (same systemd path); macOS likely affected via launchd (not directly tested here)
**Confidence:** Confirmed on Linux/WSL2

### Description

After `bash install.sh --all --non-interactive` completes successfully, the running `dream-host-agent.service` is still the OLD process from before the install: same PID, same start time, with `/proc/<pid>/cwd → /home/rosenrot/dream-server (deleted)` confirming it's reading the unlinked previous-install inode. The newly-rewritten code on disk is unused until the user manually runs `systemctl --user restart dream-host-agent.service`.

This is **not caused by any specific PR in the current open stack** — it appears to be pre-existing behavior on `Light-Heart-Labs/DreamServer@c0600ca`. But it MASKS the runtime testability of every PR that touches host-agent routes, which currently includes #893, #900, #905, #906, #907, #908. Filing it as a standalone issue because it cost me ~30 minutes of confused QA before I caught it, and it will cost the same to anyone else doing runtime validation of host-agent PRs.

### Affected File(s)

- `dream-server/installers/phases/11-services.sh` — final phase that puts the host-agent binary in place; missing a service-restart step.
- `dream-server/installers/phases/12-finalize.sh` — alternative location for the restart hook.
- `~/.config/systemd/user/dream-host-agent.service` — the systemd user unit (already in place from a previous install on test systems).

### Root Cause

The installer copies the new `dream-host-agent.py` into `~/dream-server/bin/` during phase 11 (or wherever the service binary is laid down), but never invokes `systemctl --user restart dream-host-agent.service`. The service is registered as a long-lived daemon, so:

1. `sudo rm -rf ~/dream-server` unlinks the directory inode but doesn't kill the process — the kernel keeps the inode alive as long as a process holds an open reference (the running host-agent's CWD).
2. `bash install.sh --all --non-interactive` creates a NEW `~/dream-server` directory (new inode) at the same path, populated with the new code.
3. The systemd-managed host-agent process is unchanged. Its CWD still points to the OLD (deleted) inode. Its open `bin/dream-host-agent.py` file descriptor still points to the OLD code.
4. New routes added by recent PRs are unreachable because the OLD process running OLD code doesn't know them.

### Evidence

Immediately after a successful `bash install.sh --all --non-interactive` on a system with a previously-running host-agent:

```
$ systemctl --user status dream-host-agent.service
● dream-host-agent.service - DreamServer Host Agent
     Loaded: loaded (/home/rosenrot/.config/systemd/user/dream-host-agent.service; enabled; preset: enabled)
     Active: active (running) since Sat 2026-04-11 03:07:28 +03; 21h ago
   Main PID: 75759 (python3)
```

Note: 21h uptime — that's the PID from BEFORE the install, not a new one.

```
$ sudo ls -la /proc/75759/cwd
lrwxrwxrwx ... /proc/75759/cwd -> /home/rosenrot/dream-server (deleted)
```

The cwd link confirms it's holding a deleted inode. Any new host-agent route added by a recent PR is unreachable. Example with PR #893's cancel:

```
$ curl -X POST -H "Authorization: Bearer $KEY" \
    http://127.0.0.1:3002/api/models/download/cancel
{"detail":"Not found"}
```

Host-agent journal shows the OLD process returning 404 because it doesn't know the route:
```
"POST /v1/model/download/cancel HTTP/1.1" 404 -
```

After `systemctl --user restart dream-host-agent.service`:

```
$ ps -o pid,lstart,cmd -p $(pgrep -f dream-host-agent.py)
    PID                  STARTED CMD
 353285 Sun Apr 12 00:11 2026   /usr/bin/python3 /home/rosenrot/dream-server/bin/dream-host-agent.py

$ ls -la /proc/353285/cwd
lrwxrwxrwx ... /proc/353285/cwd -> /home/rosenrot/DreamServer    # ← new, no "(deleted)"

$ curl -X POST -H "Authorization: Bearer $KEY" \
    http://127.0.0.1:3002/api/models/download/cancel
{"status":"no_download"}                                          # ← 200, works

$ journalctl --user -u dream-host-agent.service -n 5
"POST /v1/model/download/cancel HTTP/1.1" 200 -                   # ← new route reachable
```

Same exact dashboard-api call. Difference: the daemon was restarted to pick up the new binary.

### Platform Analysis

- **Linux (systemd user services):** Confirmed affected. This is the primary install path for native Linux installs.
- **Windows/WSL2:** Confirmed affected — WSL2 inherits the systemd user-service path.
- **macOS (launchd):** Likely affected by the same class of bug. macOS uses `launchctl bootload`/`bootout` to manage the host-agent plist. PR #899 modifies `installers/macos/install-macos.sh`'s launchctl handling but I could not directly verify the restart-after-rewrite behavior on Apple hardware. Worth checking — if the macOS installer also fails to restart the launchd unit after rewriting the binary, the same zombie pattern applies there.

### Reproduction

1. Have `dream-host-agent.service` (systemd user unit) already running from a previous install: `systemctl --user status dream-host-agent.service` shows active (running).
2. `sudo rm -rf ~/dream-server`
3. `bash install.sh --all --non-interactive` — install completes, reaches phase 13.
4. Check the host-agent service: `systemctl --user status dream-host-agent.service`. The PID is the SAME as before the install. Start time is the same.
5. `sudo ls -la /proc/$(pgrep -f dream-host-agent.py)/cwd` shows `... -> /home/rosenrot/dream-server (deleted)`.
6. Any new host-agent route is unreachable. Verify with `curl -X POST -H "Authorization: Bearer $DASHBOARD_API_KEY" http://127.0.0.1:3002/api/models/download/cancel`. With the OLD daemon, response is `{"detail":"Not found"}`. After `systemctl --user restart dream-host-agent.service`, response is `{"status":"no_download"}` (200).

### Impact

PRs #893 (cancel route), #905 (hook framework), #906 (compose-restart error surfacing), #907 (built-in extensions API), #908 (.env write via host agent), #900 (per-part SHA256 verify) all *appear* broken at runtime until the host-agent is manually restarted, even though their code is correctly merged on disk and the file at `~/dream-server/bin/dream-host-agent.py` is the new version.

For end users: every host-agent enhancement that lands in a release will be effectively dead code until they restart the service or reboot. For QA: every reviewer who tries to validate a host-agent PR at runtime will see the OLD behavior unless they know to restart the daemon explicitly.

### Suggested Approach

Add to `dream-server/installers/phases/11-services.sh` (after the binary is in place) or `dream-server/installers/phases/12-finalize.sh`:

```sh
# Restart the host-agent so it loads the new binary, not the inode of a deleted previous install.
if systemctl --user is-enabled dream-host-agent.service >/dev/null 2>&1; then
    log "Restarting dream-host-agent.service to load the new binary..."
    systemctl --user restart dream-host-agent.service \
        || warn "host-agent restart failed (non-fatal)"
elif command -v launchctl >/dev/null 2>&1 && [ -f "$HOME/Library/LaunchAgents/com.dreamserver.host-agent.plist" ]; then
    log "Reloading dream-host-agent launchd unit..."
    launchctl bootout "gui/$(id -u)" "$HOME/Library/LaunchAgents/com.dreamserver.host-agent.plist" 2>/dev/null || true
    launchctl bootstrap "gui/$(id -u)" "$HOME/Library/LaunchAgents/com.dreamserver.host-agent.plist" \
        || warn "host-agent launchd reload failed (non-fatal)"
fi
```

Make sure the restart happens **after** the binary is on disk, so the new process picks up the new code.

### Bonus observation

On host-agent restart the new process logs:

```
WARNING: Agent is listening on all interfaces. Set DREAM_AGENT_BIND=127.0.0.1 in .env to restrict.
```

i.e. the host-agent is on `0.0.0.0:7710` by default after a fresh install. The 0.0.0.0 bind is intentional (was deliberately moved off 127.0.0.1 to fix `Light-Heart-Labs/DreamServer#752` — host agent unreachable from Docker containers), but the warning suggests the installer isn't setting `DREAM_AGENT_BIND=127.0.0.1` in `.env` as the safer default for environments where the container reachability isn't required. Possibly worth filing as a separate "secure default" issue, or wiring up the install flow so users get a sane default and an opt-out for container reachability.

### Cross-references
- **#893**, **#905**, **#906**, **#907**, **#908** — host-agent-touching PRs in the open stack, all of which appear broken at runtime until the manual restart workaround is applied.
- **`Light-Heart-Labs/DreamServer#752`** (closed) — original reason the host-agent is on 0.0.0.0. Context for the bonus observation, not a regression.

---
*Filed during full-stack integration test of open PR stack #893–909 on `Light-Heart-Labs/DreamServer@c0600ca3`. Environment: WSL2 / Ubuntu 24.04 / systemd user-mode services / NVIDIA RTX 3070 Laptop / Tier 1.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: installer doesn't restart dream-host-agent.service after rewriting bin/dream-host-agent.py — leaves zombie reading deleted inode #334

Bug Report: installer doesn't restart dream-host-agent.service after rewriting bin/dream-host-agent.py — leaves zombie reading deleted inode

Description

Affected File(s)

Root Cause

Evidence

Platform Analysis

Reproduction

Impact

Suggested Approach

Bonus observation

Cross-references

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug: installer doesn't restart dream-host-agent.service after rewriting bin/dream-host-agent.py — leaves zombie reading deleted inode #334

Description

Bug Report: installer doesn't restart dream-host-agent.service after rewriting bin/dream-host-agent.py — leaves zombie reading deleted inode

Description

Affected File(s)

Root Cause

Evidence

Platform Analysis

Reproduction

Impact

Suggested Approach

Bonus observation

Cross-references

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions