feat(errors): clear, actionable error messages#28
Conversation
There was a problem hiding this comment.
Review
One significant bug (Rich markup leaking into JSON/YAML output), one documentation error, and two minor points.
🟠 Rich markup leaks into --output json / --output yaml
deployment list embeds Rich markup tags directly in the data dict (commands/deployment.py lines 52–53). formatter.render() in JSON/YAML mode calls json.dumps(data) without markup stripping, so consumers get literal strings like:
{"status": "[green]Active[/green]", "issues": "[yellow]1 ImagePull[/yellow]"}This breaks the CLAUDE.md contract "All commands must respect --output" and would corrupt any CI pipeline using deployment list -o json | jq.
Fix: keep raw values in the data dict; apply markup only in the _table path. The _table method already renders Rich markup from cell strings, so the simplest fix is to only wrap with color markup when building the table display. There is no --output json test for deployment list, so this regression is untested.
🟠 README example redirects the wrong stream
The JSON-mode CI example captures stderr (2>err.json), but render_diagnosis in JSON mode emits to stdout (print(json.dumps(...)) with no file=sys.stderr). The error object is on stdout; err.json will be empty; jq .fault fails.
Correct form:
zad deployment create app -c web=img:tag -o json > out.json || jq -r .fault out.json🟡 Fault.UNKNOWN mapped to exit code 1 ("your fault")
FAULT_EXIT_CODE[Fault.UNKNOWN] = 1 tells CI/CD scripts to treat unknown failures as user-fixable. The module docstring says "never claim more certainty than the data supports" — exit code 2, or a dedicated code, would be more honest: the caller can't know whether to retry or fix.
🟡 Conformance tests silently skip if spec is absent
test_spec_conformance.py uses pytest.skip when upstream-openapi.json is missing. The file exists today, but if it ever moves or the path changes, all three conformance tests pass silently as skipped — the spec-drift enforcement disappears without any CI signal.
There was a problem hiding this comment.
Review
One significant documentation bug that breaks the primary CI/CD example, plus two minor points. No backwards-incompatible changes, no design principle violations, no security issues.
🟠 README stream redirect is wrong (README.md line 74)
The example redirects stderr:
zad deployment create app -c web=img:tag -o json 2>err.json || jq -r .fault err.jsonBut render_diagnosis in json mode calls print(json.dumps(...)) — stdout, not stderr. err.json will be empty; jq fails. Correct form:
zad deployment create app -c web=img:tag -o json > out.json || jq -r .fault out.json🟡 Fault.UNKNOWN mapped to exit code 1 ("your fault — fix it") — see inline
🟡 Spec conformance tests skip silently if spec file moves — see inline
146b7b1 to
c8bcb51
Compare
|
Review-opmerkingen verwerkt:
Daarnaast em-dashes uit de gebruikersgerichte foutmeldingen en docstrings gehaald. Branch gerebased op main, tests groen (118), lint en format schoon. |
c8bcb51 to
826e37e
Compare
There was a problem hiding this comment.
Review
Two significant issues and two minor ones. No design-principle violations, no security issues. The diagnosis layer itself is well-structured.
🔴 Critical — Rich markup leaks into --output json / --output yaml for status and issues (deployment.py)
Before this PR status was a plain string ("Active"). After, it is "[green]Active[/green]". OutputFormatter._json calls json.dumps on the raw rows dict without stripping markup, so existing consumers of zad deployment list -o json now receive "[green]Active[/green]" instead of "Active". The new issues column likewise carries "[yellow]1 ImagePull[/yellow]" in json/yaml output. Both violate the additive-only change policy: the status value format changed, breaking downstream JSON consumers.
Fix: store the plain value in the dict for serialisation; apply markup only in table mode (e.g. via a Rich column formatter or a _table-only display pass).
🟠 Significant — --strict always exits 1, ignoring diagnosis.exit_code (helpers.py)
degraded_diagnoses can return a Fault.UNKNOWN diagnosis (for partial/degraded status), which has exit code 3. surface_warnings hard-codes raise typer.Exit(1) regardless. A CI consumer branching on exit code 3 would not see it from a degraded-status warning under --strict.
Fix:
worst = max(d.exit_code for d in diagnoses)
raise typer.Exit(worst)🟠 Significant — exit code changed from always-1 to 1/2/3 (documented but behavioral breaking change)
The backwards-compat policy is silent on exit codes, and the PR documents the new scheme clearly. However, teams checking if [ $? -eq 1 ] to detect any error will silently miss platform errors (now exit 2) and unknown failures (now exit 3). Worth calling out explicitly in changelog/release notes as a known breaking change for scripts that check specific exit codes.
🟡 Minor — degraded_diagnoses mutates Diagnosis after construction (errors.py)
diag = diagnose_task_failure(None, result_dict)
diag.headline = "The operation succeeded, but some components are unhealthy."The dataclass is not frozen so this works, but it is non-obvious. Prefer passing the headline as an argument or having diagnose_task_failure accept an optional override.
🟡 Minor — fallback ZadApiError(0, "Request failed") in _request has no diagnosis (client.py)
raise last_error or ZadApiError(0, "Request failed")last_error is always set to self._http_error(response) (diagnosed), but the bare fallback has no .diagnosis, breaking the invariant that every ZadApiError carries one. Likely unreachable in practice, but worth fixing for consistency.
There was a problem hiding this comment.
Review
One critical backwards-incompatible regression in JSON output, one significant gap in test coverage for it, and a few minor items.
🔴 Critical — Rich markup embedded in data dict corrupts JSON/YAML output
deployment list --output json now emits:
```json
{"status": "[green]Active[/green]", "issues": "[yellow]1 ImagePull[/yellow]", ...}
```
instead of clean strings. The status field change is strictly backwards-incompatible: any team doing `zad deployment list -o json | jq '.[].status'` gets markup tags instead of a clean string. The issues column is new so it isn't a compat break, but it's still wrong format for JSON consumers.
The fix is to keep raw values in the data dict and apply Rich markup only inside the formatter's table-rendering path (e.g., wrap the colorization in an if self.fmt == "table" guard, or pass a separate display-values dict to the formatter).
See inline comments on the specific lines.
🟠 Significant — No JSON-mode test for deployment list
test_list_shows_issues_column only invokes without --output json, so the markup leakage goes undetected in CI. Given the PR's explicit CI/CD framing (structured JSON, pipe-safe stdout), a test like CliRunner().invoke(app, ["deployment", "list", "--output", "json"]) + json.loads(result.output)[0]["status"] == "Degraded" is the natural companion.
🟠 Significant — surface_warnings not wired into component / service mutating commands
CLAUDE.md (in this very PR) states: "After any mutating op, call `surface_warnings`." Commands that modify a deployment's component set (component add, component assign) or attach services (service add) could surface a degraded component after the fact, but they don't call `surface_warnings`. Delete commands are probably safe to skip (no component health in the result), but add/assign operations seem like gaps.
🟡 Minor — Timeout uses Fault.UNKNOWN (exit 3) but the message says "may still be running"
```python
headline=f"Timed out after {self.task_timeout}s waiting for the task; it may still be running.",
next_steps=["This is a wait limit, not a failure. Check zad task status <id>."],
```
The headline describes a wait limit (transient, the task may succeed), which maps naturally to Fault.PLATFORM / exit 2 ("safe to retry"). UNKNOWN / exit 3 means "the API gave no signal to attribute the failure" — that's not the case here. A CI/CD pipeline should know it can retry a timeout, not just "check the logs."
🟡 Minor — degraded_diagnoses mutates the returned Diagnosis in place
```python
diag = diagnose_task_failure(None, result_dict)
diag.headline = "The operation succeeded, but some components are unhealthy."
```
Works fine (unfrozen dataclass), but constructing with the correct headline directly avoids the surprise. Pass headline= to diagnose_task_failure or construct the Diagnosis directly.
The CLI collapsed every failure into a bare `HTTP 500` / `Task failed`
string, making user-app and user-input problems look like the ZAD
platform was broken. The backend already attributes failures accurately
(ErrorCategory, ComponentFailureInfo with logs, HTTPValidationError,
error_type) — this surfaces that data honestly.
Add a diagnosis layer (api/errors.py): every failure maps to a Fault
(UserInput / UserApp / UserConfig / Auth / Platform / Network / Unknown)
with a neutral source label ("Source: your application (cluster
runtime)"), the concrete message, the backend's own explanation, and a
next step. Never claims more certainty than the data supports — no
category means Unknown + "see logs", not a guess.
- Parse FastAPI 422 validation arrays into readable field paths.
- Dig into failed-task component_failures (with log tails) and categories.
- Surface degraded-but-successful state (warnings, crash-looping
components) instead of a green success line.
- Show an Issues indicator in `deployment list` and `project status`.
CI/CD:
- `--output json` emits a structured diagnosis; branch on `.fault`.
- Fault-mapped exit codes: 1 = your fault, 2 = platform/transient.
- `--strict` fails the build when an op succeeds but reports warnings.
- Diagnostics to stderr, data to stdout; ascii-safe glyphs in non-TTY.
Spec coupling: CATEGORY_FAULT/CATEGORY_HINT are keyed by ErrorCategory,
and test_spec_conformance.py asserts the enum matches the vendored
OpenAPI spec and that every category is mapped — drift fails CI, so the
daily api-sync workflow turns it into an actionable PR.
Drop the defensive "not your request" / "not you" asides from error headlines and docs. The neutral Source: line already says where to look, so headlines lead with what's wrong and what to do — not with deflecting blame. Same behaviour, friendlier and more useful tone.
- README: json-diagnose gaat naar stdout, dus > out.json i.p.v. 2>err.json - UNKNOWN krijgt eigen exit code 3 (niet 1 'jouw fout', niet 2 'veilig retryen') zodat de exit code de honesty-lijn volgt; README- en CLAUDE-tabel bijgewerkt - spec-conformance: ontbrekend specbestand faalt hard i.p.v. stil te skippen - verwijder em-dashes uit gebruikersgerichte foutmeldingen en docstrings
826e37e to
189e58f
Compare
There was a problem hiding this comment.
The diagnosis layer (api/errors.py), the spec-conformance tests, and the --strict / exit-code design are all solid. One backwards-incompatible regression needs fixing before merge; one related design issue should be fixed in the same pass.
🔴 Critical — deployment list --output json now emits Rich markup in status
See inline comment on commands/deployment.py. On main, status is a plain string ("Active"). After this PR it is "[green]Active[/green]" in the serialised JSON, breaking any existing JSON consumer. The fix is a one-liner: keep the raw value in the rows dict and apply markup only inside the table renderer (the same way describe does it elsewhere in the codebase).
🟠 Significant — issues column also embeds Rich markup in JSON
See inline comment on helpers.py:issues_cell. Because this is a new column the strict backwards-compat bar doesn't apply, but it has the same design flaw: "[yellow]1 ImagePull[/yellow]" is not machine-readable. Fix alongside the status change — return a plain string and colour it only in the table path.
🟡 Minor — --strict exit code is hardcoded to 1
See inline comment on helpers.py:surface_warnings. Inconsistent with the documented exit code contract: a PLATFORM-fault warning (retryable) would exit 1 instead of 2 under --strict. Easy fix: use max(d.exit_code for d in diagnoses).
Everything else checks out:
- No CLI commands, options, or positional arguments removed ✓
- No public
ZadClientmethod signatures broken — newdiagnosisparams are keyword-only withNonedefaults ✓ @handle_api_errorsunchanged on existing commands ✓--dry-run/--yesconventions unchanged ✓- Data → stdout, diagnostics → stderr separation is correct ✓
- Spec-conformance test path (
api/upstream-openapi.json) matches the actual file location ✓ - 118 tests covering the new paths, including conformance and unit tests for every fault type ✓
Onder --strict gaf een degraded-but-successful warning altijd exit 1, ook voor retryable PLATFORM/NETWORK-faults (contract: exit 2) en UNKNOWN (exit 3). Gebruik nu max(d.exit_code) over de diagnoses zodat de exit code het gedocumenteerde contract volgt.
|
Bedankt voor de review. Eén punt verwerkt, twee weerlegd met bewijs: 🟡 🔴 if formatter.fmt in ("json", "yaml"):
formatter.render(deployments) # ruwe data, geen markup
returnEmpirisch geverifieerd: 🟠 Tests groen (128), lint en format schoon. Branch is up-to-date met main. |
There was a problem hiding this comment.
All issues from the previous round are resolved. Here is what I checked.
Previous findings — status
🔴 Critical (markup in JSON status) — Not actually a bug. The if formatter.fmt in ("json", "yaml"): formatter.render(deployments); return early exit already existed on main before this PR. Table-mode rows with Rich markup are never reached in JSON/YAML mode. The issues_cell() markup is likewise only used in the table path.
🟠 Significant (markup in JSON issues) — Same conclusion: issues is table-only; formatter.render(deployments) returns the raw API response in JSON mode, so neither status nor issues markup appears in JSON output.
🟡 Minor (--strict exit code hardcoded to 1) — Fixed in 8ffb8f4: surface_warnings now uses max(d.exit_code for d in diagnoses), honoring the per-fault contract. Covered by test_create_strict_exit_code_follows_fault_contract.
Backwards compatibility
- No CLI commands, options, or positional arguments removed. ✓
- No public
ZadClientmethods removed. ✓ - New
diagnosisparams on all three exception constructors are keyword-only withNonedefaults — no required positional args reduced.test_backwards_compat.pybaseline passes unchanged. ✓ --strictis a new additive global flag. ✓
Design principles
@handle_api_errorsunchanged on all existing commands. ✓--dry-run/--yesconventions unchanged. ✓- Data → stdout, diagnostics → stderr (and diagnosis JSON → stdout for failures, where there is no other stdout payload). ✓
surface_warningsfollows the template: called afterrender_success, exits non-zero under--strictviatyper.Exit. ✓--strictregistered in_GlobalOptionsGroup._FLAGSso it is hoisted correctly. ✓
API client patterns
_http_erroris a private static helper — clean, no public API surface added. ✓diagnosisparam added toZadApiError,TaskFailedError,TaskTimeoutError— all additive. ✓- New Pydantic models (
ComponentFailureInfo,SubtaskStatus,ProcessingStatus) usemodel_validatewithextra = 'ignore'semantics via Pydantic defaults — loose coupling preserved. ✓
Test coverage
118 tests (was 91). New test_errors.py and test_spec_conformance.py cover the diagnosis layer and spec drift detection. test_client.py exercises diagnosed exceptions end-to-end. test_output.py verifies JSON → stdout and warnings → stderr separation. ✓
One minor gap worth noting
🟡 There is no test for deployment list --output json in tests/commands/test_deployment.py. The JSON early-return path is correct by inspection, but a one-liner smoke test (assert "[green]" not in json.loads(...)["status"]) would make any future regression visible without reading the code. Not blocking.
Security
No hardcoded credentials, no shell injection paths. ✓
What this does
Makes the CLI's errors genuinely helpful: every failure says what's wrong, where to look, and what to do next — instead of a bare
HTTP 500/Task failed.The upstream API already provides the signal for this (
ErrorCategory,ComponentFailureInfowith log tails,HTTPValidationError,error_type) — the CLI just wasn't using it. A new diagnosis layer (api/errors.py) turns those signals into a clear message with a neutral source label (so you know where to look, without finger-pointing) and an actionable next step.Clearer messages everywhere
components.0.image: field required) instead of a Python-dict dump.⚠with the fix, instead of a green success line.deployment listandproject status, so degraded deployments aren't silently green.When the API doesn't give enough signal, the message says so and points at the logs — it never guesses.
CI/CD friendly
--output jsonemits a structured diagnosis on stdout; branch on.fault(UserInput/UserApp/UserConfig/Auth/Platform/Network/Unknown).0success ·1something to fix on your side ·2platform/transient (safe to retry).--strictturns degraded-but-successful into a non-zero exit, so pipelines fail on an unhealthy app.Spec-driven, low-maintenance
CATEGORY_FAULT/CATEGORY_HINTare keyed byErrorCategory, andtests/test_spec_conformance.pyasserts the enum matches the vendoredapi/upstream-openapi.jsonand that every category is mapped. If the API adds a category, the daily api-sync workflow surfaces it as a failing test → an actionable PR. Runtime parsing degrades gracefully on anything new (loose coupling).Tests
118 pass (was 91): new
test_errors.pyandtest_spec_conformance.py, plus extensions totest_client.py,test_output.py, andtest_deployment.py.ruff check/formatclean.