Skip to content

fix(dream-cli): pipefail + surface LLM failures + exit-code contract#998

Open
yasinBursali wants to merge 5 commits intoLight-Heart-Labs:mainfrom
yasinBursali:fix/dream-cli-pipefail-exit-codes
Open

fix(dream-cli): pipefail + surface LLM failures + exit-code contract#998
yasinBursali wants to merge 5 commits intoLight-Heart-Labs:mainfrom
yasinBursali:fix/dream-cli-pipefail-exit-codes

Conversation

@yasinBursali
Copy link
Copy Markdown
Contributor

@yasinBursali yasinBursali commented Apr 23, 2026

What

Three independent error-discipline improvements to dream-cli:

  1. Promote shell options from set -e to set -eo pipefail, aligning with CLAUDE.md's project standard. Audit | head -1 sites (converted to | sed -n 1p for SIGPIPE-safety).
  2. dream chat and dream benchmark now actually detect LLM failures instead of silently reporting "success" against a dead backend. Uses curl --fail --show-error --max-time + jq -er strict parsing, plus a /v1/models preflight probe.
  3. Three commands (dream disable, dream agent logs, dream gpu reassign) now exit non-zero on failure/skip conditions, fixing the exit-code contract for scripted callers.

Why

Before this PR: dream chat against a dead backend printed [dream] Sending to ..., emitted an empty line, and exited 0. dream benchmark propagated that silent failure and happily reported sub-second "Excellent" performance against a non-running LLM. Root cause: curl was invoked with -s (silent, no error-show, no fail-on-HTTP-error) and responses were parsed with jq -r .choices[0].message.content // .error.message // "Error: no response" — every failure path masked behind a human-friendly string on stdout.

dream disable bogus-service used to silently stop a non-existent container (|| true) and print a success-shaped message. Several other commands exited 0 after warning about skip conditions, so automation couldn't distinguish "success" from "skipped."

How

Three commits, each self-contained:

  • dfaa4c8f — adopt set -eo pipefail; convert two | head -1 sites to | sed -n 1p.
  • 385372ec — rewrite chat/benchmark with preflight probe + strict curl --fail + jq -er. cmd_benchmark uses if ! response=$(cmd_chat ...) so chat failure aborts the benchmark. On the main chat call, the flag is bare --show-error (no --fail*) so HTTP 4xx/5xx responses flow through to the existing jq fallback which extracts .error.message cleanly — also keeps compat with curl < 7.76 (Ubuntu 20.04, Debian 11, RHEL 8).
  • 10588c2a — exit-code contract for cmd_disable, cmd_agent logs, _gpu_reassign.

Testing

  • dream chat "hi" against a stopped llama-server: now exits 1 with "llama-server not reachable..." (previously silent empty output, exit 0).
  • dream benchmark against a stopped backend: now aborts with "Benchmark failed: LLM unreachable..." (previously false-positive sub-second "Excellent").
  • dream disable bogus-service: exits 1 with "Unknown service" (previously exit 0 after silent stop).
  • Round-1 review: Critique Guardian approved. Round-2 adversarial audit: verified across 5 failure paths including the --fail / no-flag asymmetry on the two curl sites.

Known Considerations

Nounset (set -u) is not added in this PR — it requires a full bare-variable audit, which lands as the sibling refactor(dream-cli): enable set -u PR on top of this one.

Platform Impact

  • macOS: chat/benchmark runs locally against llama-server; curl 8.x tested.
  • Linux Ubuntu 20.04 / Debian 11 / RHEL 8: the --fail-with-body flag (curl ≥ 7.76, Mar 2021) was deliberately avoided on the main chat call so these older-curl distros work too.
  • Windows (WSL2): same behavior as Linux.

⚠️ Merge order — must land #1008 first

Adopting set -eo pipefail here turns 7 existing grep "^KEY=" .env | cut | tr pipelines in dream-cli into hard exits when the key is absent. PR #1008 (refactor(dream-cli): guard .env grep pipelines against pipefail kill on missing keys) appends || true to those 7 sites so the surrounding defensive [[ -n ... ]] / ${var:-default} checks keep working.

If this PR merges alone, dream update, dream rollback, dream enable, dream preset, and dream dry-run can abort mid-flow on installs that predate (or have manually-edited-out) one of those .env keys. #1008 is preventive on main today and load-bearing the moment this PR lands. Marked as Draft until #1008 is in to make this gate mechanical.


📋 Chain status (operator's reminder — apr-25)

This PR sits in the middle of a dream-cli strict-mode chain:

#1008 (READY) ──▶ #998 (this — DRAFT) ──▶ #1002 (DRAFT)
                                     └──▶ #1018 (DRAFT)

Before this PR can be promoted:

After this PR merges, the next operator step is:

  1. Rebase refactor(dream-cli): enable set -u and add guards for conditional variables #1002 onto post-fix(dream-cli): pipefail + surface LLM failures + exit-code contract #998 main. The 6+ shared hunks (line 6 set-line, identical sub-hunks in cmd_chat/cmd_benchmark/cmd_disable/cmd_agent/_gpu_reassign, | head -1sed -n '1p' swaps) drop out automatically. What remains is refactor(dream-cli): enable set -u and add guards for conditional variables #1002's actual unique work: the -u flag and :- empty-string default guards.
  2. Promote refactor(dream-cli): enable set -u and add guards for conditional variables #1002 from DRAFT → ready (gh pr ready 1002).
  3. Same pattern for test(dream-cli): BATS regression shield for 5 dream-cli / supporting behaviors #1018 once its other deps (fix(dream-cli): schema-driven secret masking + macOS Bash 4 validation #994, fix(dashboard,dashboard-api): sentinel-based setup wizard success detection #1003, fix(dream-cli): Apple GPU output polish + compose wrapper SIGINT/zero-match #1016) have landed.

#1002 and #1018 are kept in DRAFT as a mechanical safeguard so they cannot merge before this one.

@Lightheartdevs
Copy link
Copy Markdown
Collaborator

Audit follow-up: keep draft / rebase on current main.

The focused pipefail guard from #1008 is now merged. This draft still needs to absorb or preserve that fix: _check_version_compat must tolerate a .env without DREAM_VERSION and fall back to .version/manifest.json instead of exiting under pipefail.

yasinBursali and others added 5 commits April 29, 2026 03:39
Promote shell option from set -e to set -eo pipefail so failures in
upstream pipeline commands propagate and surface the real exit code
(not the status of the last stage). Aligns dream-cli with the project
standard (CLAUDE.md: 'set -euo pipefail everywhere').

Scope is deliberately narrow — nounset (-u) is intentionally not added
here because it requires a full bare-variable audit.

Side-effect audit:
- Two '| head -1' sites converted to '| sed -n 1p' (SIGPIPE-safe under
  pipefail; upstream producers no longer die when head closes the pipe):
    * _check_version_compat: grep DREAM_VERSION .env
    * preset import:          tar tzf archive listing
- '|| true' sites reviewed individually; all are intentional tolerance
  of known-benign failures (missing optional env keys, idempotent stops,
  arithmetic increment quirks) and were left in place.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…mark'

Before: 'dream chat' against a dead backend printed '[dream] Sending to
...', emitted an empty line, and exited 0. 'dream benchmark' propagated
that silent failure and happily reported sub-second 'Excellent'
performance while the LLM was not even running.

Root cause: curl was invoked with '-s' (silent, no error-show, no
fail-on-HTTP-error) and the response was parsed with 'jq -r
.choices[0].message.content // .error.message // "Error: no response"',
which masks every failure path behind a human-friendly string on stdout.

Fix:
- Add a /v1/models pre-flight probe with --fail-with-body and
  --max-time 3, so an unreachable llama-server fails loudly before the
  user sees 'Sending to ...'.
- Use --show-error --fail-with-body --max-time 30 on the completion
  call; non-2xx HTTP responses now fail the command and the body is
  shown.
- Parse responses with 'jq -er' so missing '.choices[0].message.content'
  exits non-zero; on failure we re-parse '.error.message' and surface
  it through 'error' (exit 1).
- cmd_benchmark: capture cmd_chat output via 'if ! response=...' so a
  failed chat aborts the benchmark instead of reporting a false-positive
  duration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ssign

Three commands previously printed warnings or pseudo-success messages
while exiting 0, breaking scripted callers that rely on exit codes to
detect failure.

- cmd_disable: 'sr_resolve' returns its input unchanged for unknown
  names, so 'dream disable bogus-service' used to silently stop a
  non-existent container ('|| true') and print a success-shaped
  message. Mirror the validation pattern already in cmd_enable: if
  neither the built-in nor user-extension directory exists for the
  resolved id, 'error "Unknown service: <input>"' (exit 1).

- cmd_agent logs: on a fresh install with no host-agent log file, the
  logs subcommand warned and returned 0. Return 1 so 'dream agent
  logs' fails fast when there is nothing to tail.

- _gpu_reassign: the NVIDIA-only guard and three prerequisite guards
  (jq missing, topo lib missing, assign script missing) all exited 0
  after warning, so automation could not tell whether reassignment
  succeeded or was skipped. Each 'warn ...; return' becomes 'warn ...;
  return 1'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Maintainer audit on PR Light-Heart-Labs#998 (Lightheartdevs, 2026-04-28) flagged that
under `set -euo pipefail`, `_check_version_compat` must tolerate a
`.env` lacking `DREAM_VERSION` and fall back to `.version` /
`manifest.json`. The audit-required form is the trailing `|| true`
on the DREAM_VERSION grep pipeline:

    _COMPAT_INSTALLED_VER=$(grep '^DREAM_VERSION=' "$INSTALL_DIR/.env" 2>/dev/null \
        | sed -n '1p' | cut -d= -f2 | tr -d '[:space:]' || true)

The rebase onto current main (which carries Light-Heart-Labs#1008's tolerance fix)
preserved this form via conflict resolution; this test locks it in.

`tests/test-dream-cli-version-compat-pipefail.sh` (4 cases, wired
into `make test`):

1. dream-cli runs under strict mode (set -e or stricter).
2. The DREAM_VERSION grep pipeline ends with `|| true` (matched against
   active code; comment lines stripped before grep so the rationale
   block can't satisfy the assertion on its own).
3. `.version` and `manifest.json` fallback branches still exist (they
   are the destinations the `|| true` enables reaching).
4. Anti-regression: the bare-close form (no trailing `|| true`)
   does NOT appear — catches a future "cleanup" PR removing the guard.

Behavioural test was attempted but rejected: `_check_version_compat`
depends on `warn`/`log`/`_manifest_field`/`_semver_*` and several
file-scope color vars, so isolating it requires too much harness
setup. The dream-cli BATS suite (Light-Heart-Labs#1018) carries the behavioural
coverage at the right layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ipelines under pipefail

CG follow-up on PR Light-Heart-Labs#998: this branch's first commit (`f8395b9a`)
flips `set -e` → `set -eo pipefail`. CG flagged that three
split-local-assignment pipelines elsewhere in dream-cli become
fragile under the new flag — they read external files whose fields
may be absent, partial, or mid-write:

1. `cmd_status` bootstrap-status.json reader (lines 647/650/651) —
   `dream status` aborts mid-output if `bootstrap-status.json` is
   mid-write or lacks the `"status":` field. Repro: status check
   during model bootstrap, partial JSON.

2. `cmd_preset list` meta.txt reader (lines 1954/1955) — a single
   preset with a malformed `meta.txt` (missing `created=` or
   `gpu_backend=`) aborts the listing of all subsequent presets.

3. `cmd_chat` jq-fallback (line 1279) — newly added in this PR's
   `61be91b0` commit. If the LLM returns unparseable JSON, jq
   exits 5, pipefail propagates, and the function aborts before
   reaching the friendly `error "LLM error: ${_err:-unparseable
   response}"` message.

Fix: append `|| true` to each of the three pipelines. Inline
comments explain the tolerance rationale (one comment block per
fix site, citing pipefail-introduction).

Sites #1 and #2 are pre-existing code newly exposed by the pipefail
flip; site #3 is new code from this PR. None block the audit
resolution (the `_check_version_compat` fix is unaffected) — but
fixing them in the same PR keeps the strict-mode rollout clean.

`make test` green, `make lint` green, `bash -n dream-cli` clean.
The 4-case regression test for `_check_version_compat` still passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yasinBursali yasinBursali force-pushed the fix/dream-cli-pipefail-exit-codes branch from 10588c2 to 3ecbfe1 Compare April 29, 2026 00:52
@yasinBursali
Copy link
Copy Markdown
Contributor Author

Pushed audit follow-up — rebased onto current upstream/main, conflict resolved, audit-required || true preserved, 4-case regression test added, plus 3 latent split-local pipefail bombs fixed (CG follow-up).

Audit resolution:
The conflict in _check_version_compat was exactly the audit's target — branch's commit dfaa4c8f had stripped the || true from the DREAM_VERSION grep pipeline; current upstream/main (post-#1008) carries the audit-required tolerance. Conflict resolution kept upstream's || true form with an inline rationale comment citing #1008 + this audit so a future "cleanup" reviewer cannot remove it without context.

_COMPAT_INSTALLED_VER=$(grep '^DREAM_VERSION=' "$INSTALL_DIR/.env" 2>/dev/null \
    | sed -n '1p' | cut -d= -f2 | tr -d '[:space:]' || true)

The .version and manifest.json fallback branches that follow are intact and reachable.

Regression test (9d7aa45e): tests/test-dream-cli-version-compat-pipefail.sh (4 cases, wired into make test):

  • dream-cli runs under strict mode (precondition).
  • DREAM_VERSION grep pipeline ends with || true (active code; comments stripped before grep so the rationale block alone can't satisfy the assertion).
  • .version and manifest.json fallback branches present.
  • Anti-regression: bare-close form (no trailing || true) does NOT appear.

Sabotage-validated: stripping || true from the pipeline makes cases 2 and 4 both fail.

CG follow-up (3ecbfe13): CG returned APPROVED WITH WARNINGS and flagged 3 latent split-local pipelines now fragile under the pipefail flip:

  • cmd_status reading bootstrap-status.json (lines 647/650/651) — could abort dream status mid-output on a partial/mid-write JSON.
  • cmd_preset list reading meta.txt (lines 1954/1955) — a single malformed preset would abort the listing of all subsequent ones.
  • cmd_chat jq-fallback (line 1279, newly added in 61be91b0) — unparseable LLM JSON would silently abort instead of reaching the friendly LLM error: ${_err:-unparseable response}.

All 3 sites now have || true with inline rationale comments. Sites #1 and #2 are pre-existing code newly exposed by this PR's pipefail flip; site #3 is new code from this PR.

make lint green, make test green (including the new pipefail-tolerance regression).

Promoting from draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants