Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,8 +153,41 @@ def verb(
result = client.method(project, ...)
formatter.render(result)
formatter.render_success(f"Target '{target}' verbed.")
surface_warnings(ctx, formatter, result) # mutating ops: flag degraded-but-successful state
```

### Error reporting (the diagnosis layer)

Errors must give **clarity and a next step**: say what's wrong, point neutrally at
where to look, and suggest the fix. The machinery lives in `api/errors.py`:

- **`Fault`** (StrEnum): `USER_INPUT`, `USER_APP`, `USER_CONFIG`, `AUTH`, `PLATFORM`,
`NETWORK`, `UNKNOWN`. Drives a neutral source label (`FAULT_SOURCE`), color
(`FAULT_COLOR`), and CI/CD exit code (`FAULT_EXIT_CODE`: 1 = your fault, 2 =
platform/transient, 3 = unknown/unattributable).
- **`Diagnosis`** (dataclass): `fault`, `headline`, `summary`, `details`, `next_steps`,
`status_code`. `to_dict()` is the json contract (CI branches on `fault`).
- **`diagnose_http_error`** parses status codes and FastAPI `422` validation arrays;
**`diagnose_task_failure`** digs into `processing.component_failures`, `error_type`,
and `ErrorCategory`; **`degraded_diagnoses`** catches "succeeded but unhealthy".

Rules for new code:
- The client raises `ZadApiError` / `TaskFailedError` / `TaskTimeoutError` with a
`.diagnosis` attached (build it at the raise site via the `diagnose_*` helpers or
`_http_error`). `handle_api_errors` renders it and exits with `diagnosis.exit_code`.
- Render failures with `formatter.render_diagnosis(d)`, degraded-success with
`formatter.render_warnings(diags)`. Diagnostics go to **stderr**; json error objects
go to stdout. Never hardcode an error string where a `Diagnosis` belongs.
- After any mutating op, call `surface_warnings(ctx, formatter, result)` so warnings /
unhealthy components are surfaced (and `--strict` can fail CI).
- **Honesty:** when the API gives no category, the fault is `UNKNOWN` and we point at
the logs (exit code 3); don't guess whose fault it is.

**Spec coupling:** `CATEGORY_FAULT` / `CATEGORY_HINT` are keyed by `ErrorCategory`, and
`tests/test_spec_conformance.py` asserts the enum matches `api/upstream-openapi.json` and
that every category is mapped. When the api-sync workflow surfaces a new `ErrorCategory`,
add it to `models.ErrorCategory` **and** both maps; the conformance test tells you.

### Client method conventions

- One public method per API endpoint on `ZadClient`
Expand Down
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ zad backup create production
| API URL | `--api-url` | `ZAD_API_URL` | `api_url` | production URL |
| Output | `-o` | `ZAD_OUTPUT_FORMAT` | - | `table` |
| No wait | `--no-wait` | - | - | wait |
| Strict | `--strict` | - | - | off |

Precedence: **flags > env vars / `.env` > config file > defaults**

Expand All @@ -71,6 +72,35 @@ Every command supports `--output` / `-o`: `table` (default), `json`, `yaml`.
zad metrics overview --output json | jq '.cpu_usage'
```

## Errors & exit codes

Errors tell you **what's wrong and what to do next**, with a neutral label for where
to look (your request, your application, your configuration, your credentials, or the
ZAD platform) instead of a bare HTTP code. A failed image pull points you straight at
the image and registry (`Source: your application (cluster runtime)`) with the fix.

Each error carries a structured diagnosis. In `--output json` it's a single object
on stdout you can branch on in CI/CD:

```bash
zad deployment create app -c web=img:tag -o json > out.json || jq -r .fault out.json
# UserInput | UserApp | UserConfig | Auth | Platform | Network | Unknown
```

Exit codes:

| Code | Meaning |
|------|---------|
| `0` | success |
| `1` | your fault, fix it (bad input, app/config failure, auth) |
| `2` | platform/network, transient and safe to retry |
| `3` | unknown, the API gave no signal to attribute the failure (check the logs) |

`--strict` makes a command that *succeeds but reports warnings* (e.g. the deploy
applied but a component is crash-looping) exit non-zero, so a pipeline fails the
build instead of going green on an unhealthy app. Diagnostics go to **stderr**;
data (and the json error object) go to **stdout**, so pipes stay clean.

## Commands

```
Expand Down
84 changes: 67 additions & 17 deletions src/zad_cli/api/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,33 +10,41 @@
import httpx
from pydantic import ValidationError

from zad_cli.api.errors import Diagnosis, Fault, diagnose_http_error, diagnose_task_failure
from zad_cli.api.models import DeploymentDetail, DeploymentListResponse, TaskStatus


class ZadApiError(Exception):
"""Raised when the ZAD API returns an error."""
"""Raised when the ZAD API returns an error.

def __init__(self, status_code: int, message: str, details: dict | None = None):
Carries a :class:`~zad_cli.api.errors.Diagnosis` so the CLI can render an
honest, source-labelled message instead of a bare ``HTTP <code>``.
"""

def __init__(self, status_code: int, message: str, details: dict | None = None, diagnosis: Diagnosis | None = None):
self.status_code = status_code
self.message = message
self.details = details or {}
self.diagnosis = diagnosis
super().__init__(f"HTTP {status_code}: {message}")


class TaskTimeoutError(Exception):
"""Raised when task polling exceeds the timeout."""

def __init__(self, message: str, task_id: str | None = None):
def __init__(self, message: str, task_id: str | None = None, diagnosis: Diagnosis | None = None):
self.task_id = task_id
self.diagnosis = diagnosis
super().__init__(message)


class TaskFailedError(Exception):
"""Raised when a polled task reports failure."""

def __init__(self, message: str, details: dict | None = None):
def __init__(self, message: str, details: dict | None = None, diagnosis: Diagnosis | None = None):
self.message = message
self.details = details or {}
self.diagnosis = diagnosis
super().__init__(message)


Expand All @@ -48,7 +56,20 @@ def _parse_v2_response(model_cls: type, payload: Any) -> dict:
try:
return model_cls.model_validate(payload).model_dump(mode="json")
except ValidationError as e:
raise ZadApiError(502, f"Unexpected API response shape for {model_cls.__name__}: {e}") from e
raise ZadApiError(
502,
f"Unexpected API response shape for {model_cls.__name__}: {e}",
diagnosis=Diagnosis(
fault=Fault.PLATFORM,
headline="ZAD returned a response this CLI couldn't read; likely a CLI/API version mismatch.",
summary=f"Schema {model_cls.__name__} failed to validate.",
next_steps=[
"Retry shortly (exit code 2 = transient).",
"If it persists, the CLI may be out of date; update it or report the mismatch.",
],
status_code=502,
),
) from e


class ZadClient:
Expand Down Expand Up @@ -113,22 +134,17 @@ def _request(self, method: str, path: str, **kwargs: Any) -> httpx.Response:
time.sleep(delay)
delay *= 2
continue
raise ZadApiError(0, f"Connection failed: {e}") from e
raise ZadApiError(0, f"Connection failed: {e}", diagnosis=diagnose_http_error(0, str(e))) from e

if response.status_code in _RETRYABLE_CODES and attempt < self.max_retries:
print(f"HTTP {response.status_code}, retrying in {delay}s...", file=sys.stderr)
time.sleep(delay)
delay *= 2
last_error = ZadApiError(response.status_code, response.text)
last_error = self._http_error(response)
continue

if response.status_code >= 400:
try:
body = response.json()
message = body.get("message", body.get("detail", response.text))
except Exception:
message = response.text
raise ZadApiError(response.status_code, message)
raise self._http_error(response)

if self.verbose:
print(f"<-- {response.status_code} ({response.elapsed.total_seconds():.2f}s)", file=sys.stderr)
Expand All @@ -137,6 +153,21 @@ def _request(self, method: str, path: str, **kwargs: Any) -> httpx.Response:

raise last_error or ZadApiError(0, "Request failed")

@staticmethod
def _http_error(response: httpx.Response) -> ZadApiError:
"""Build a diagnosed ZadApiError from a >=400 response."""
try:
body: Any = response.json()
except Exception:
body = response.text
if isinstance(body, dict):
message = body.get("message") or body.get("detail") or response.text
else:
message = response.text or str(body)
if not isinstance(message, str):
message = str(message)
return ZadApiError(response.status_code, message, diagnosis=diagnose_http_error(response.status_code, body))

def _async_request(self, method: str, path: str, **kwargs: Any) -> dict:
"""Make a v2 async request. Polls for result unless self.wait is False."""
response = self._request(method, path, **kwargs)
Expand Down Expand Up @@ -183,7 +214,7 @@ def _poll_task(self, poll_url: str) -> dict:
continue

if response.status_code >= 400:
raise ZadApiError(response.status_code, data.get("detail", data.get("message", str(data))))
raise self._http_error(response)

status = TaskStatus(**data) if isinstance(data, dict) else TaskStatus(status="unknown")
task_id = task_id or data.get("task_id")
Expand All @@ -195,13 +226,32 @@ def _poll_task(self, poll_url: str) -> dict:
if status.status == "completed":
return status.result or data
if status.status == "failed":
raise TaskFailedError(status.error_message or "Task failed", details=status.result)
raise TaskFailedError(
status.error_message or "Task failed",
details=status.result,
diagnosis=diagnose_task_failure(status.error_message, status.result),
)
if status.status == "cancelled":
raise TaskFailedError("Task was cancelled")
raise TaskFailedError(
"Task was cancelled",
diagnosis=Diagnosis(
fault=Fault.UNKNOWN,
headline="The task was cancelled before it finished.",
next_steps=["Re-run the command, or check `zad task list` for details."],
),
)

time.sleep(self.task_poll_interval)

raise TaskTimeoutError(f"Task did not complete within {self.task_timeout}s", task_id=task_id)
raise TaskTimeoutError(
f"Task did not complete within {self.task_timeout}s",
task_id=task_id,
diagnosis=Diagnosis(
fault=Fault.UNKNOWN,
headline=f"Timed out after {self.task_timeout}s waiting for the task; it may still be running.",
next_steps=["This is a wait limit, not a failure. Check `zad task status <id>`."],
),
)

# --- V2 project/deployment operations (async, poll for result) ---

Expand Down
Loading
Loading