Skip to content

feat(errors): clear, actionable error messages#28

Merged
anneschuth merged 4 commits into
mainfrom
feat/honest-error-reporting
Jun 19, 2026
Merged

feat(errors): clear, actionable error messages#28
anneschuth merged 4 commits into
mainfrom
feat/honest-error-reporting

Conversation

@uittenbroekrobbert

@uittenbroekrobbert uittenbroekrobbert commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What this does

Makes the CLI's errors genuinely helpful: every failure says what's wrong, where to look, and what to do next — instead of a bare HTTP 500 / Task failed.

✗ Your application didn't start on the cluster (the deploy reached the cluster; the workload failed).
  Source: your application (cluster runtime)

  web (ImagePull): Back-off pulling image ghcr.io/org/app:bad
      Error: manifest unknown

  → Check the image tag exists and the registry is reachable / credentials are set.
  → Inspect `zad logs -d <deployment>` and `zad deployment describe <deployment>`.

The upstream API already provides the signal for this (ErrorCategory, ComponentFailureInfo with log tails, HTTPValidationError, error_type) — the CLI just wasn't using it. A new diagnosis layer (api/errors.py) turns those signals into a clear message with a neutral source label (so you know where to look, without finger-pointing) and an actionable next step.

Clearer messages everywhere

  • Bad input (422): validation arrays become readable field lines (components.0.image: field required) instead of a Python-dict dump.
  • Failed deploys: the failing component, its category, and a tail of its logs — plus the concrete fix.
  • Auth / not-found / conflict: plain language + the exact thing to set or check.
  • Platform/network: "usually transient — retry"; honest and short.
  • Succeeded-but-degraded: a deploy that applies cleanly but leaves a component crash-looping now shows a with the fix, instead of a green success line.
  • At a glance: an Issues indicator in deployment list and project status, so degraded deployments aren't silently green.

When the API doesn't give enough signal, the message says so and points at the logs — it never guesses.

CI/CD friendly

  • --output json emits a structured diagnosis on stdout; branch on .fault (UserInput / UserApp / UserConfig / Auth / Platform / Network / Unknown).
  • Exit codes: 0 success · 1 something to fix on your side · 2 platform/transient (safe to retry).
  • --strict turns degraded-but-successful into a non-zero exit, so pipelines fail on an unhealthy app.
  • Diagnostics → stderr, data → stdout; ascii-safe glyphs and no ANSI in non-TTY logs.

Spec-driven, low-maintenance

CATEGORY_FAULT / CATEGORY_HINT are keyed by ErrorCategory, and tests/test_spec_conformance.py asserts the enum matches the vendored api/upstream-openapi.json and that every category is mapped. If the API adds a category, the daily api-sync workflow surfaces it as a failing test → an actionable PR. Runtime parsing degrades gracefully on anything new (loose coupling).

Tests

118 pass (was 91): new test_errors.py and test_spec_conformance.py, plus extensions to test_client.py, test_output.py, and test_deployment.py. ruff check/format clean.

claude[bot]
claude Bot previously requested changes Jun 8, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

One significant bug (Rich markup leaking into JSON/YAML output), one documentation error, and two minor points.


🟠 Rich markup leaks into --output json / --output yaml

deployment list embeds Rich markup tags directly in the data dict (commands/deployment.py lines 52–53). formatter.render() in JSON/YAML mode calls json.dumps(data) without markup stripping, so consumers get literal strings like:

{"status": "[green]Active[/green]", "issues": "[yellow]1 ImagePull[/yellow]"}

This breaks the CLAUDE.md contract "All commands must respect --output" and would corrupt any CI pipeline using deployment list -o json | jq.

Fix: keep raw values in the data dict; apply markup only in the _table path. The _table method already renders Rich markup from cell strings, so the simplest fix is to only wrap with color markup when building the table display. There is no --output json test for deployment list, so this regression is untested.


🟠 README example redirects the wrong stream

The JSON-mode CI example captures stderr (2>err.json), but render_diagnosis in JSON mode emits to stdout (print(json.dumps(...)) with no file=sys.stderr). The error object is on stdout; err.json will be empty; jq .fault fails.

Correct form:

zad deployment create app -c web=img:tag -o json > out.json || jq -r .fault out.json

🟡 Fault.UNKNOWN mapped to exit code 1 ("your fault")

FAULT_EXIT_CODE[Fault.UNKNOWN] = 1 tells CI/CD scripts to treat unknown failures as user-fixable. The module docstring says "never claim more certainty than the data supports" — exit code 2, or a dedicated code, would be more honest: the caller can't know whether to retry or fix.


🟡 Conformance tests silently skip if spec is absent

test_spec_conformance.py uses pytest.skip when upstream-openapi.json is missing. The file exists today, but if it ever moves or the path changes, all three conformance tests pass silently as skipped — the spec-drift enforcement disappears without any CI signal.

@uittenbroekrobbert uittenbroekrobbert changed the title feat(errors): honest, source-labelled error reporting feat(errors): clear, actionable error messages Jun 8, 2026
@claude claude Bot dismissed their stale review June 8, 2026 13:13

Superseded by new review

claude[bot]
claude Bot previously requested changes Jun 8, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

One significant documentation bug that breaks the primary CI/CD example, plus two minor points. No backwards-incompatible changes, no design principle violations, no security issues.


🟠 README stream redirect is wrong (README.md line 74)

The example redirects stderr:

zad deployment create app -c web=img:tag -o json 2>err.json || jq -r .fault err.json

But render_diagnosis in json mode calls print(json.dumps(...)) — stdout, not stderr. err.json will be empty; jq fails. Correct form:

zad deployment create app -c web=img:tag -o json > out.json || jq -r .fault out.json

🟡 Fault.UNKNOWN mapped to exit code 1 ("your fault — fix it") — see inline

🟡 Spec conformance tests skip silently if spec file moves — see inline

@anneschuth

Copy link
Copy Markdown
Member

Review-opmerkingen verwerkt:

  • README regel 74: de json-diagnose gaat naar stdout (render_diagnosis print zonder file=), dus het voorbeeld is nu > out.json in plaats van 2>err.json. Anders bleef err.json leeg en faalde jq.
  • Fault.UNKNOWN zat op exit code 1 ('jouw fout'), wat botst met de honesty-lijn van de module: bij UNKNOWN weten we juist niet wiens fout het is. UNKNOWN krijgt nu een eigen exit code 3 (niet 'jouw fout', niet 'veilig retryen'). README- en CLAUDE-tabel zijn bijgewerkt, plus de test.
  • test_spec_conformance.py skipte stil als het specbestand ontbrak; dat liet de hele conformance-check ongemerkt groen worden als het bestand zou verplaatsen. Nu een harde fail met instructie om _SPEC_PATH bij te werken.

Daarnaast em-dashes uit de gebruikersgerichte foutmeldingen en docstrings gehaald. Branch gerebased op main, tests groen (118), lint en format schoon.

@claude claude Bot dismissed their stale review June 19, 2026 12:09

Superseded by new review

@anneschuth anneschuth force-pushed the feat/honest-error-reporting branch from c8bcb51 to 826e37e Compare June 19, 2026 12:12

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Two significant issues and two minor ones. No design-principle violations, no security issues. The diagnosis layer itself is well-structured.


🔴 Critical — Rich markup leaks into --output json / --output yaml for status and issues (deployment.py)

Before this PR status was a plain string ("Active"). After, it is "[green]Active[/green]". OutputFormatter._json calls json.dumps on the raw rows dict without stripping markup, so existing consumers of zad deployment list -o json now receive "[green]Active[/green]" instead of "Active". The new issues column likewise carries "[yellow]1 ImagePull[/yellow]" in json/yaml output. Both violate the additive-only change policy: the status value format changed, breaking downstream JSON consumers.

Fix: store the plain value in the dict for serialisation; apply markup only in table mode (e.g. via a Rich column formatter or a _table-only display pass).


🟠 Significant — --strict always exits 1, ignoring diagnosis.exit_code (helpers.py)

degraded_diagnoses can return a Fault.UNKNOWN diagnosis (for partial/degraded status), which has exit code 3. surface_warnings hard-codes raise typer.Exit(1) regardless. A CI consumer branching on exit code 3 would not see it from a degraded-status warning under --strict.

Fix:

worst = max(d.exit_code for d in diagnoses)
raise typer.Exit(worst)

🟠 Significant — exit code changed from always-1 to 1/2/3 (documented but behavioral breaking change)

The backwards-compat policy is silent on exit codes, and the PR documents the new scheme clearly. However, teams checking if [ $? -eq 1 ] to detect any error will silently miss platform errors (now exit 2) and unknown failures (now exit 3). Worth calling out explicitly in changelog/release notes as a known breaking change for scripts that check specific exit codes.


🟡 Minor — degraded_diagnoses mutates Diagnosis after construction (errors.py)

diag = diagnose_task_failure(None, result_dict)
diag.headline = "The operation succeeded, but some components are unhealthy."

The dataclass is not frozen so this works, but it is non-obvious. Prefer passing the headline as an argument or having diagnose_task_failure accept an optional override.


🟡 Minor — fallback ZadApiError(0, "Request failed") in _request has no diagnosis (client.py)

raise last_error or ZadApiError(0, "Request failed")

last_error is always set to self._http_error(response) (diagnosed), but the bare fallback has no .diagnosis, breaking the invariant that every ZadApiError carries one. Likely unreachable in practice, but worth fixing for consistency.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

One critical backwards-incompatible regression in JSON output, one significant gap in test coverage for it, and a few minor items.

🔴 Critical — Rich markup embedded in data dict corrupts JSON/YAML output

deployment list --output json now emits:
```json
{"status": "[green]Active[/green]", "issues": "[yellow]1 ImagePull[/yellow]", ...}
```
instead of clean strings. The status field change is strictly backwards-incompatible: any team doing `zad deployment list -o json | jq '.[].status'` gets markup tags instead of a clean string. The issues column is new so it isn't a compat break, but it's still wrong format for JSON consumers.

The fix is to keep raw values in the data dict and apply Rich markup only inside the formatter's table-rendering path (e.g., wrap the colorization in an if self.fmt == "table" guard, or pass a separate display-values dict to the formatter).

See inline comments on the specific lines.

🟠 Significant — No JSON-mode test for deployment list

test_list_shows_issues_column only invokes without --output json, so the markup leakage goes undetected in CI. Given the PR's explicit CI/CD framing (structured JSON, pipe-safe stdout), a test like CliRunner().invoke(app, ["deployment", "list", "--output", "json"]) + json.loads(result.output)[0]["status"] == "Degraded" is the natural companion.

🟠 Significant — surface_warnings not wired into component / service mutating commands

CLAUDE.md (in this very PR) states: "After any mutating op, call `surface_warnings`." Commands that modify a deployment's component set (component add, component assign) or attach services (service add) could surface a degraded component after the fact, but they don't call `surface_warnings`. Delete commands are probably safe to skip (no component health in the result), but add/assign operations seem like gaps.

🟡 Minor — Timeout uses Fault.UNKNOWN (exit 3) but the message says "may still be running"

```python
headline=f"Timed out after {self.task_timeout}s waiting for the task; it may still be running.",
next_steps=["This is a wait limit, not a failure. Check zad task status <id>."],
```
The headline describes a wait limit (transient, the task may succeed), which maps naturally to Fault.PLATFORM / exit 2 ("safe to retry"). UNKNOWN / exit 3 means "the API gave no signal to attribute the failure" — that's not the case here. A CI/CD pipeline should know it can retry a timeout, not just "check the logs."

🟡 Minor — degraded_diagnoses mutates the returned Diagnosis in place

```python
diag = diagnose_task_failure(None, result_dict)
diag.headline = "The operation succeeded, but some components are unhealthy."
```
Works fine (unfrozen dataclass), but constructing with the correct headline directly avoids the surprise. Pass headline= to diagnose_task_failure or construct the Diagnosis directly.

uittenbroekrobbert and others added 3 commits June 19, 2026 14:16
The CLI collapsed every failure into a bare `HTTP 500` / `Task failed`
string, making user-app and user-input problems look like the ZAD
platform was broken. The backend already attributes failures accurately
(ErrorCategory, ComponentFailureInfo with logs, HTTPValidationError,
error_type) — this surfaces that data honestly.

Add a diagnosis layer (api/errors.py): every failure maps to a Fault
(UserInput / UserApp / UserConfig / Auth / Platform / Network / Unknown)
with a neutral source label ("Source: your application (cluster
runtime)"), the concrete message, the backend's own explanation, and a
next step. Never claims more certainty than the data supports — no
category means Unknown + "see logs", not a guess.

- Parse FastAPI 422 validation arrays into readable field paths.
- Dig into failed-task component_failures (with log tails) and categories.
- Surface degraded-but-successful state (warnings, crash-looping
  components) instead of a green success line.
- Show an Issues indicator in `deployment list` and `project status`.

CI/CD:
- `--output json` emits a structured diagnosis; branch on `.fault`.
- Fault-mapped exit codes: 1 = your fault, 2 = platform/transient.
- `--strict` fails the build when an op succeeds but reports warnings.
- Diagnostics to stderr, data to stdout; ascii-safe glyphs in non-TTY.

Spec coupling: CATEGORY_FAULT/CATEGORY_HINT are keyed by ErrorCategory,
and test_spec_conformance.py asserts the enum matches the vendored
OpenAPI spec and that every category is mapped — drift fails CI, so the
daily api-sync workflow turns it into an actionable PR.
Drop the defensive "not your request" / "not you" asides from error
headlines and docs. The neutral Source: line already says where to look,
so headlines lead with what's wrong and what to do — not with deflecting
blame. Same behaviour, friendlier and more useful tone.
- README: json-diagnose gaat naar stdout, dus > out.json i.p.v. 2>err.json
- UNKNOWN krijgt eigen exit code 3 (niet 1 'jouw fout', niet 2 'veilig retryen')
  zodat de exit code de honesty-lijn volgt; README- en CLAUDE-tabel bijgewerkt
- spec-conformance: ontbrekend specbestand faalt hard i.p.v. stil te skippen
- verwijder em-dashes uit gebruikersgerichte foutmeldingen en docstrings
@anneschuth anneschuth force-pushed the feat/honest-error-reporting branch from 826e37e to 189e58f Compare June 19, 2026 12:16
claude[bot]
claude Bot previously requested changes Jun 19, 2026

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diagnosis layer (api/errors.py), the spec-conformance tests, and the --strict / exit-code design are all solid. One backwards-incompatible regression needs fixing before merge; one related design issue should be fixed in the same pass.


🔴 Critical — deployment list --output json now emits Rich markup in status

See inline comment on commands/deployment.py. On main, status is a plain string ("Active"). After this PR it is "[green]Active[/green]" in the serialised JSON, breaking any existing JSON consumer. The fix is a one-liner: keep the raw value in the rows dict and apply markup only inside the table renderer (the same way describe does it elsewhere in the codebase).

🟠 Significant — issues column also embeds Rich markup in JSON

See inline comment on helpers.py:issues_cell. Because this is a new column the strict backwards-compat bar doesn't apply, but it has the same design flaw: "[yellow]1 ImagePull[/yellow]" is not machine-readable. Fix alongside the status change — return a plain string and colour it only in the table path.

🟡 Minor — --strict exit code is hardcoded to 1

See inline comment on helpers.py:surface_warnings. Inconsistent with the documented exit code contract: a PLATFORM-fault warning (retryable) would exit 1 instead of 2 under --strict. Easy fix: use max(d.exit_code for d in diagnoses).


Everything else checks out:

  • No CLI commands, options, or positional arguments removed ✓
  • No public ZadClient method signatures broken — new diagnosis params are keyword-only with None defaults ✓
  • @handle_api_errors unchanged on existing commands ✓
  • --dry-run / --yes conventions unchanged ✓
  • Data → stdout, diagnostics → stderr separation is correct ✓
  • Spec-conformance test path (api/upstream-openapi.json) matches the actual file location ✓
  • 118 tests covering the new paths, including conformance and unit tests for every fault type ✓

Onder --strict gaf een degraded-but-successful warning altijd exit 1, ook voor
retryable PLATFORM/NETWORK-faults (contract: exit 2) en UNKNOWN (exit 3). Gebruik
nu max(d.exit_code) over de diagnoses zodat de exit code het gedocumenteerde
contract volgt.
@anneschuth

Copy link
Copy Markdown
Member

Bedankt voor de review. Eén punt verwerkt, twee weerlegd met bewijs:

🟡 --strict exit code (terecht, gefixed): surface_warnings hardcodede exit 1. Gebruikt nu max(d.exit_code for d in diagnoses) zodat een retryable PLATFORM/NETWORK-warning exit 2 geeft en UNKNOWN exit 3, conform het contract. Test toegevoegd (test_create_strict_exit_code_follows_fault_contract).

🔴 deployment list --output json zou Rich-markup lekken (false positive): de markup-rows worden pas gebouwd na een early-return voor json/yaml. Regel 41-43 van commands/deployment.py:

if formatter.fmt in ("json", "yaml"):
    formatter.render(deployments)   # ruwe data, geen markup
    return

Empirisch geverifieerd: deployment list --output json geeft status: "Degraded" (ruwe waarde), geen [green]/[red] in de output. Geen backwards-incompat regressie.

🟠 issues-kolom in JSON (false positive, zelfde reden): issues_cell wordt alleen aangeroepen in het table-pad. project status heeft dezelfde early-return voor json/yaml. De markup komt nooit in de geserialiseerde JSON terecht.

Tests groen (128), lint en format schoon. Branch is up-to-date met main.

@claude claude Bot dismissed their stale review June 19, 2026 12:26

Superseded by new review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All issues from the previous round are resolved. Here is what I checked.


Previous findings — status

🔴 Critical (markup in JSON status) — Not actually a bug. The if formatter.fmt in ("json", "yaml"): formatter.render(deployments); return early exit already existed on main before this PR. Table-mode rows with Rich markup are never reached in JSON/YAML mode. The issues_cell() markup is likewise only used in the table path.

🟠 Significant (markup in JSON issues) — Same conclusion: issues is table-only; formatter.render(deployments) returns the raw API response in JSON mode, so neither status nor issues markup appears in JSON output.

🟡 Minor (--strict exit code hardcoded to 1) — Fixed in 8ffb8f4: surface_warnings now uses max(d.exit_code for d in diagnoses), honoring the per-fault contract. Covered by test_create_strict_exit_code_follows_fault_contract.


Backwards compatibility

  • No CLI commands, options, or positional arguments removed. ✓
  • No public ZadClient methods removed. ✓
  • New diagnosis params on all three exception constructors are keyword-only with None defaults — no required positional args reduced. test_backwards_compat.py baseline passes unchanged. ✓
  • --strict is a new additive global flag. ✓

Design principles

  • @handle_api_errors unchanged on all existing commands. ✓
  • --dry-run / --yes conventions unchanged. ✓
  • Data → stdout, diagnostics → stderr (and diagnosis JSON → stdout for failures, where there is no other stdout payload). ✓
  • surface_warnings follows the template: called after render_success, exits non-zero under --strict via typer.Exit. ✓
  • --strict registered in _GlobalOptionsGroup._FLAGS so it is hoisted correctly. ✓

API client patterns

  • _http_error is a private static helper — clean, no public API surface added. ✓
  • diagnosis param added to ZadApiError, TaskFailedError, TaskTimeoutError — all additive. ✓
  • New Pydantic models (ComponentFailureInfo, SubtaskStatus, ProcessingStatus) use model_validate with extra = 'ignore' semantics via Pydantic defaults — loose coupling preserved. ✓

Test coverage

118 tests (was 91). New test_errors.py and test_spec_conformance.py cover the diagnosis layer and spec drift detection. test_client.py exercises diagnosed exceptions end-to-end. test_output.py verifies JSON → stdout and warnings → stderr separation. ✓

One minor gap worth noting

🟡 There is no test for deployment list --output json in tests/commands/test_deployment.py. The JSON early-return path is correct by inspection, but a one-liner smoke test (assert "[green]" not in json.loads(...)["status"]) would make any future regression visible without reading the code. Not blocking.

Security

No hardcoded credentials, no shell injection paths. ✓

@anneschuth anneschuth merged commit 4fee19f into main Jun 19, 2026
9 checks passed
@anneschuth anneschuth deleted the feat/honest-error-reporting branch June 19, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants