fix(dream-cli): pipefail + surface LLM failures + exit-code contract#998
fix(dream-cli): pipefail + surface LLM failures + exit-code contract#998yasinBursali wants to merge 5 commits intoLight-Heart-Labs:mainfrom
Conversation
|
Audit follow-up: keep draft / rebase on current The focused pipefail guard from #1008 is now merged. This draft still needs to absorb or preserve that fix: |
Promote shell option from set -e to set -eo pipefail so failures in
upstream pipeline commands propagate and surface the real exit code
(not the status of the last stage). Aligns dream-cli with the project
standard (CLAUDE.md: 'set -euo pipefail everywhere').
Scope is deliberately narrow — nounset (-u) is intentionally not added
here because it requires a full bare-variable audit.
Side-effect audit:
- Two '| head -1' sites converted to '| sed -n 1p' (SIGPIPE-safe under
pipefail; upstream producers no longer die when head closes the pipe):
* _check_version_compat: grep DREAM_VERSION .env
* preset import: tar tzf archive listing
- '|| true' sites reviewed individually; all are intentional tolerance
of known-benign failures (missing optional env keys, idempotent stops,
arithmetic increment quirks) and were left in place.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…mark' Before: 'dream chat' against a dead backend printed '[dream] Sending to ...', emitted an empty line, and exited 0. 'dream benchmark' propagated that silent failure and happily reported sub-second 'Excellent' performance while the LLM was not even running. Root cause: curl was invoked with '-s' (silent, no error-show, no fail-on-HTTP-error) and the response was parsed with 'jq -r .choices[0].message.content // .error.message // "Error: no response"', which masks every failure path behind a human-friendly string on stdout. Fix: - Add a /v1/models pre-flight probe with --fail-with-body and --max-time 3, so an unreachable llama-server fails loudly before the user sees 'Sending to ...'. - Use --show-error --fail-with-body --max-time 30 on the completion call; non-2xx HTTP responses now fail the command and the body is shown. - Parse responses with 'jq -er' so missing '.choices[0].message.content' exits non-zero; on failure we re-parse '.error.message' and surface it through 'error' (exit 1). - cmd_benchmark: capture cmd_chat output via 'if ! response=...' so a failed chat aborts the benchmark instead of reporting a false-positive duration. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ssign
Three commands previously printed warnings or pseudo-success messages
while exiting 0, breaking scripted callers that rely on exit codes to
detect failure.
- cmd_disable: 'sr_resolve' returns its input unchanged for unknown
names, so 'dream disable bogus-service' used to silently stop a
non-existent container ('|| true') and print a success-shaped
message. Mirror the validation pattern already in cmd_enable: if
neither the built-in nor user-extension directory exists for the
resolved id, 'error "Unknown service: <input>"' (exit 1).
- cmd_agent logs: on a fresh install with no host-agent log file, the
logs subcommand warned and returned 0. Return 1 so 'dream agent
logs' fails fast when there is nothing to tail.
- _gpu_reassign: the NVIDIA-only guard and three prerequisite guards
(jq missing, topo lib missing, assign script missing) all exited 0
after warning, so automation could not tell whether reassignment
succeeded or was skipped. Each 'warn ...; return' becomes 'warn ...;
return 1'.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Maintainer audit on PR Light-Heart-Labs#998 (Lightheartdevs, 2026-04-28) flagged that under `set -euo pipefail`, `_check_version_compat` must tolerate a `.env` lacking `DREAM_VERSION` and fall back to `.version` / `manifest.json`. The audit-required form is the trailing `|| true` on the DREAM_VERSION grep pipeline: _COMPAT_INSTALLED_VER=$(grep '^DREAM_VERSION=' "$INSTALL_DIR/.env" 2>/dev/null \ | sed -n '1p' | cut -d= -f2 | tr -d '[:space:]' || true) The rebase onto current main (which carries Light-Heart-Labs#1008's tolerance fix) preserved this form via conflict resolution; this test locks it in. `tests/test-dream-cli-version-compat-pipefail.sh` (4 cases, wired into `make test`): 1. dream-cli runs under strict mode (set -e or stricter). 2. The DREAM_VERSION grep pipeline ends with `|| true` (matched against active code; comment lines stripped before grep so the rationale block can't satisfy the assertion on its own). 3. `.version` and `manifest.json` fallback branches still exist (they are the destinations the `|| true` enables reaching). 4. Anti-regression: the bare-close form (no trailing `|| true`) does NOT appear — catches a future "cleanup" PR removing the guard. Behavioural test was attempted but rejected: `_check_version_compat` depends on `warn`/`log`/`_manifest_field`/`_semver_*` and several file-scope color vars, so isolating it requires too much harness setup. The dream-cli BATS suite (Light-Heart-Labs#1018) carries the behavioural coverage at the right layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ipelines under pipefail CG follow-up on PR Light-Heart-Labs#998: this branch's first commit (`f8395b9a`) flips `set -e` → `set -eo pipefail`. CG flagged that three split-local-assignment pipelines elsewhere in dream-cli become fragile under the new flag — they read external files whose fields may be absent, partial, or mid-write: 1. `cmd_status` bootstrap-status.json reader (lines 647/650/651) — `dream status` aborts mid-output if `bootstrap-status.json` is mid-write or lacks the `"status":` field. Repro: status check during model bootstrap, partial JSON. 2. `cmd_preset list` meta.txt reader (lines 1954/1955) — a single preset with a malformed `meta.txt` (missing `created=` or `gpu_backend=`) aborts the listing of all subsequent presets. 3. `cmd_chat` jq-fallback (line 1279) — newly added in this PR's `61be91b0` commit. If the LLM returns unparseable JSON, jq exits 5, pipefail propagates, and the function aborts before reaching the friendly `error "LLM error: ${_err:-unparseable response}"` message. Fix: append `|| true` to each of the three pipelines. Inline comments explain the tolerance rationale (one comment block per fix site, citing pipefail-introduction). Sites #1 and #2 are pre-existing code newly exposed by the pipefail flip; site #3 is new code from this PR. None block the audit resolution (the `_check_version_compat` fix is unaffected) — but fixing them in the same PR keeps the strict-mode rollout clean. `make test` green, `make lint` green, `bash -n dream-cli` clean. The 4-case regression test for `_check_version_compat` still passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10588c2 to
3ecbfe1
Compare
|
Pushed audit follow-up — rebased onto current Audit resolution: _COMPAT_INSTALLED_VER=$(grep '^DREAM_VERSION=' "$INSTALL_DIR/.env" 2>/dev/null \
| sed -n '1p' | cut -d= -f2 | tr -d '[:space:]' || true)The Regression test (
Sabotage-validated: stripping CG follow-up (
All 3 sites now have
Promoting from draft. |
What
Three independent error-discipline improvements to
dream-cli:set -etoset -eo pipefail, aligning with CLAUDE.md's project standard. Audit| head -1sites (converted to| sed -n 1pfor SIGPIPE-safety).dream chatanddream benchmarknow actually detect LLM failures instead of silently reporting "success" against a dead backend. Usescurl --fail --show-error --max-time+jq -erstrict parsing, plus a/v1/modelspreflight probe.dream disable,dream agent logs,dream gpu reassign) now exit non-zero on failure/skip conditions, fixing the exit-code contract for scripted callers.Why
Before this PR:
dream chatagainst a dead backend printed[dream] Sending to ..., emitted an empty line, and exited 0.dream benchmarkpropagated that silent failure and happily reported sub-second "Excellent" performance against a non-running LLM. Root cause: curl was invoked with-s(silent, no error-show, no fail-on-HTTP-error) and responses were parsed withjq -r .choices[0].message.content // .error.message // "Error: no response"— every failure path masked behind a human-friendly string on stdout.dream disable bogus-serviceused to silently stop a non-existent container (|| true) and print a success-shaped message. Several other commands exited 0 after warning about skip conditions, so automation couldn't distinguish "success" from "skipped."How
Three commits, each self-contained:
dfaa4c8f— adoptset -eo pipefail; convert two| head -1sites to| sed -n 1p.385372ec— rewrite chat/benchmark with preflight probe + strictcurl --fail+jq -er.cmd_benchmarkusesif ! response=$(cmd_chat ...)so chat failure aborts the benchmark. On the main chat call, the flag is bare--show-error(no--fail*) so HTTP 4xx/5xx responses flow through to the existingjqfallback which extracts.error.messagecleanly — also keeps compat withcurl < 7.76(Ubuntu 20.04, Debian 11, RHEL 8).10588c2a— exit-code contract forcmd_disable,cmd_agent logs,_gpu_reassign.Testing
dream chat "hi"against a stopped llama-server: now exits 1 with "llama-server not reachable..." (previously silent empty output, exit 0).dream benchmarkagainst a stopped backend: now aborts with "Benchmark failed: LLM unreachable..." (previously false-positive sub-second "Excellent").dream disable bogus-service: exits 1 with "Unknown service" (previously exit 0 after silent stop).--fail/ no-flag asymmetry on the two curl sites.Known Considerations
Nounset (
set -u) is not added in this PR — it requires a full bare-variable audit, which lands as the siblingrefactor(dream-cli): enable set -uPR on top of this one.Platform Impact
--fail-with-bodyflag (curl ≥ 7.76, Mar 2021) was deliberately avoided on the main chat call so these older-curl distros work too.Adopting
set -eo pipefailhere turns 7 existinggrep "^KEY=" .env | cut | trpipelines indream-cliinto hard exits when the key is absent. PR #1008 (refactor(dream-cli): guard .env grep pipelines against pipefail kill on missing keys) appends|| trueto those 7 sites so the surrounding defensive[[ -n ... ]]/${var:-default}checks keep working.If this PR merges alone,
dream update,dream rollback,dream enable,dream preset, anddream dry-runcan abort mid-flow on installs that predate (or have manually-edited-out) one of those .env keys. #1008 is preventive onmaintoday and load-bearing the moment this PR lands. Marked as Draft until #1008 is in to make this gate mechanical.📋 Chain status (operator's reminder — apr-25)
This PR sits in the middle of a
dream-clistrict-mode chain:Before this PR can be promoted:
set -eo pipefaillands here).After this PR merges, the next operator step is:
cmd_chat/cmd_benchmark/cmd_disable/cmd_agent/_gpu_reassign,| head -1→sed -n '1p'swaps) drop out automatically. What remains is refactor(dream-cli): enable set -u and add guards for conditional variables #1002's actual unique work: the-uflag and:-empty-string default guards.gh pr ready 1002).#1002 and #1018 are kept in DRAFT as a mechanical safeguard so they cannot merge before this one.