Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 168 additions & 0 deletions docs/controllers/errors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Error Handling

Errors in kube originate from multiple layers. Understanding where each error comes from and how to handle it is key to building resilient controllers.

## Error Layers

```mermaid
graph TD
A["Client::send()"] -->|"network / TLS / timeout"| E1["kube::Error::HyperError\nkube::Error::HttpError"]
B["Api::list() / get() / patch()"] -->|"4xx / 5xx"| E2["kube::Error::Api"]
B -->|"deserialization failure"| E3["kube::Error::SerializationError"]
C["watcher()"] -->|"initial LIST failed"| E4["watcher::Error::InitialListFailed"]
C -->|"WATCH connect failed"| E5["watcher::Error::WatchFailed"]
C -->|"server error during WATCH"| E6["watcher::Error::WatchError"]
D["Controller::run()"] -->|"trigger stream"| C
D -->|"user code"| E7["reconciler Error"]

style E1 fill:#ffebee
style E2 fill:#ffebee
style E7 fill:#fff3e0
```

| Layer | Error type | Typical cause |
|-------|-----------|---------------|
| Client | `HyperError`, `HttpError` | Network, TLS, timeout |
| [Api] | `Error::Api { status }` | Kubernetes 4xx/5xx response |
| [Api] | `SerializationError` | JSON deserialization failure |
| [watcher] | `InitialListFailed` | Initial LIST call failed |
| [watcher] | `WatchFailed` | WATCH connection failed |
| [watcher] | `WatchError` | Server error during WATCH (e.g. 410 Gone) |
| [Controller] | reconciler Error | Error from user code |

## Watcher Errors and Backoff

Watcher errors are **soft errors** — the [watcher] retries on all failures (including 403s, network issues) because external circumstances may improve. They should never be **silently** discarded. See the [troubleshooting page](../troubleshooting.md#watcher-errors) for diagnostic examples.

The critical requirement is attaching a backoff to the watcher stream:

```rust
// ✗ Without backoff, errors cause a tight retry loop
let stream = watcher(api, wc);

// ✓ Exponential backoff with automatic retry
let stream = watcher(api, wc).default_backoff();
```

### default_backoff

Applies an `ExponentialBackoff`: 800ms → 1.6s → 3.2s → ... → 30s (max). The backoff resets whenever a successful event is received.

### Custom backoff

```rust
use backon::ExponentialBuilder;

let stream = watcher(api, wc).backoff(
ExponentialBuilder::default()
.with_min_delay(Duration::from_millis(500))
.with_max_delay(Duration::from_secs(30)),
);
```

## Reconciler Errors

### Defining error types

[Controller::run] requires specific trait bounds on the error type, so `anyhow::Error` cannot be used directly. Define a concrete error type with [thiserror]:

```rust
#[derive(Debug, thiserror::Error)]
enum Error {
#[error("Kubernetes API error: {0}")]
KubeApi(#[from] kube::Error),

#[error("Missing spec field: {0}")]
MissingField(String),

#[error("External service error: {0}")]
External(String),
}
```

### error_policy

When the reconciler returns `Err`, the `error_policy` function decides what happens next:

```rust
fn error_policy(obj: Arc<MyResource>, err: &Error, ctx: Arc<Context>) -> Action {
tracing::error!(?err, "reconcile failed");
Action::requeue(Duration::from_secs(5))
}
```

You can distinguish transient from permanent errors:

| Type | Examples | Handling |
|------|----------|---------|
| Transient | Network error, timeout, 429 | Requeue via `error_policy` |
| Permanent | Invalid spec, bad config | Record condition on status + `Action::await_change()` |

```rust
fn error_policy(obj: Arc<MyResource>, err: &Error, ctx: Arc<Context>) -> Action {
match err {
// Transient: retry
Error::KubeApi(_) | Error::External(_) => {
Action::requeue(Duration::from_secs(5))
}
// Permanent: don't retry until the object changes
Error::MissingField(_) => Action::await_change(),
}
}
```

!!! note "Current limitations"

`error_policy` is a **synchronous** function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself with a middleware that tracks per-object retry state.

Comment on lines +114 to +117
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is a great callout.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually looking at this some more, there's no such pattern described in the reconciler documentation.

## Client-level Retry

By default, kube-client does not retry regular API calls. If a `create()`, `patch()`, or `get()` fails, the error is returned as-is.

Since version 3, kube provides a built-in [`RetryPolicy`](https://docs.rs/kube/latest/kube/client/retry/struct.RetryPolicy.html) that implements [tower]'s retry middleware. It retries on 429, 503, and 504 with exponential backoff:

```rust
use kube::client::retry::RetryPolicy;
use tower::{ServiceBuilder, retry::RetryLayer, buffer::BufferLayer};

let service = ServiceBuilder::new()
.layer(config.base_uri_layer())
.option_layer(config.auth_layer()?)
.layer(BufferLayer::new(1024))
.layer(RetryLayer::new(RetryPolicy::default()))
// ...
```

`RetryPolicy` specifically retries **429**, **503**, and **504** responses. It does not retry network errors or other 5xx codes.

For broader retry guidance when designing your own error handling:

| Error | Retryable | Where to handle |
|-------|-----------|-----------------|
| 429, 503, 504 | Yes | `RetryPolicy` handles automatically |
| Other 5xx | Depends | `error_policy` or custom Tower middleware |
| Timeout / Network | Yes | `error_policy` requeue, or watcher backoff |
| 4xx (400, 403, 404) | No | Fix the request or RBAC |
| 409 Conflict | No | SSA ownership conflict — fix field managers |

## Timeout Strategy

If you need to guard against slow API calls in your reconciler, you can wrap individual calls with `tokio::time::timeout`:

```rust
// First ? unwraps the timeout Result<T, Elapsed>
// Second ? unwraps the API Result<Pod, kube::Error>
let pod = tokio::time::timeout(
Duration::from_secs(10),
api.get("my-pod"),
).await??;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double questionmark

```

In a [Controller] context, stream timeouts rely internally on watcher timeouts and can be configured via stream backoff parameters and [watcher::Config]. Only individual API calls inside your reconciler typically need shorter timeouts.

--8<-- "includes/abbreviations.md"
--8<-- "includes/links.md"

[//begin]: # "Autogenerated link references for markdown compatibility"
[reconciler]: reconciler "The Reconciler"
[//end]: # "Autogenerated link references"
181 changes: 181 additions & 0 deletions docs/controllers/ssa.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Server-Side Apply

[Server-Side Apply] is a Kubernetes patch strategy based on field ownership. It allows multiple controllers to safely modify the same resource by tracking which controller owns which fields.

This page covers practical patterns, common pitfalls, and status patching with SSA in kube.

!!! note "SSA and Reconciler Idempotency"

SSA naturally fits the [[reconciler]]'s idempotent pattern: you declare "these fields should have these values", and the server handles the rest. See [[reconciler#in-depth-solution]] for how SSA simplifies reconciler logic.

## Why SSA

The traditional patch strategies each have limitations:

| Strategy | Limitation |
|----------|-----------|
| Merge patch | Overwrites entire arrays. Field deletion is not explicit |
| Strategic merge patch | Only works with k8s-openapi types. Incomplete for CRDs |
| JSON patch | Requires exact paths. Susceptible to race conditions |

SSA addresses these:

- **Field ownership**: the server records "this controller owns this field"
- **Conflict detection**: touching another owner's field produces a `409 Conflict`
- **Declarative**: you declare which fields should have which values; everything else is left untouched

## Basic Pattern

```rust
use kube::api::{Patch, PatchParams};

let patch = Patch::Apply(serde_json::json!({
"apiVersion": "v1",
"kind": "ConfigMap",
"metadata": { "name": "my-cm" },
"data": { "key": "value" }
}));
let pp = PatchParams::apply("my-controller"); // field manager name
api.patch("my-cm", &pp, &patch).await?;
```

The `"my-controller"` string in `PatchParams::apply` is the **field manager** name. Ownership is tracked under this name. Applying again with the same field manager updates owned fields; fields owned by other managers are left alone.

## Common Pitfalls

### Missing apiVersion and kind

```rust
// ✗ 400 Bad Request
let patch = Patch::Apply(serde_json::json!({
"data": { "key": "value" }
}));

// ✓ apiVersion and kind are required
let patch = Patch::Apply(serde_json::json!({
"apiVersion": "v1",
"kind": "ConfigMap",
"metadata": { "name": "my-cm" },
"data": { "key": "value" }
}));
```

Unlike merge patch, SSA requires `apiVersion` and `kind` in every request.

### Missing field manager

```rust
// ✗ field_manager is None → API server rejects the request
let pp = PatchParams::default();

// ✓ Explicit field manager
let pp = PatchParams::apply("my-controller");
```
Comment on lines +67 to +73
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field managers are required for serverside apply so using PatchParams::default with apply should probably be validated as an error in PatchParams rather than documented here as an eternal footgun.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — this should be a client-side validation rather than a doc-only warning. PatchParams::validate() already rejects force with non-Apply patches, but doesn't check field_manager: None with Patch::Apply. I'll open an issue on kube-rs/kube for adding this check.


A field manager is **required** for SSA. When `field_manager` is `None` (the default), the API server returns an error. Always use `PatchParams::apply("my-controller")` for SSA operations.

### Overusing force

```rust
// Caution: forcibly takes ownership of fields from other managers
let pp = PatchParams::apply("my-controller").force();
```

`force: true` takes ownership of fields from other controllers. Only use this in single-owner situations such as CRD registration.

### Including unnecessary fields

Serializing an entire Rust struct includes `Default` value fields. SSA takes ownership of those fields, causing conflicts when another controller tries to modify them.

```rust
// ✗ Serializes all Default fields → unnecessary ownership
let full_deployment = Deployment { ..Default::default() };

// ✓ Only include fields you actually manage
let patch = serde_json::json!({
"apiVersion": "apps/v1",
"kind": "Deployment",
"metadata": { "name": "my-deploy" },
"spec": {
"replicas": 3
}
});
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of this problem tangled up in kubernetes objects not being fully optional e.g. some replicas ints. i believe this is what drove them to later make ApplyConfigurations variants where everything is truly optional. rust unfortunately does not have an equivalent of applyconfigurations kube-rs/kube#649 so serializing parts of structs can on some occasions be annoying if you want to use a typed interface for partial SSA.

possibly this is worth a current limitations callout somewhere 🤔

!!! note "Current limitation: no ApplyConfigurations in Rust"

Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) - fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches works around this issue.

## Status Patching

Status is modified through the `/status` subresource:

```rust
let status_patch = serde_json::json!({
"apiVersion": "example.com/v1",
"kind": "MyResource",
"status": {
"phase": "Ready",
"conditions": [{
"type": "Available",
"status": "True",
"lastTransitionTime": "2024-01-01T00:00:00Z",
}]
}
});
let pp = PatchParams::apply("my-controller");
api.patch_status("name", &pp, &Patch::Apply(status_patch)).await?;
```

!!! warning "Wrap status in the full object structure"

```rust
// ✗ Sending just the status fields will fail
serde_json::json!({ "phase": "Ready" })

// ✓ Must include apiVersion, kind, and wrap under "status"
serde_json::json!({
"apiVersion": "example.com/v1",
"kind": "MyResource",
"status": { "phase": "Ready" }
})
```

The Kubernetes API expects the full object structure even on the `/status` endpoint.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. good callout.

this is definitely still a footgun. i wish this could be typed better using default None and builders;

let updated = instance.status(MyCrdStatus::default().phase("Ready".into());
client.patch_status<MyCrd>(updated, ssa).await?

and have the patch_status strip the spec for users. but it relies on builders (e.g. [k8s-pb](kube-rs/k8s-pb#9, and fulll optionality like applyconfigurations).


## Typed SSA

Instead of `serde_json::json!()`, you can use Rust types for type safety and IDE autocompletion:

```rust
let cm = ConfigMap {
metadata: ObjectMeta {
name: Some("my-cm".into()),
..Default::default()
},
data: Some(BTreeMap::from([("key".into(), "value".into())])),
..Default::default()
};
let pp = PatchParams::apply("my-controller");
api.patch("my-cm", &pp, &Patch::Apply(cm)).await?;
```

[k8s-openapi] types already have `#[serde(skip_serializing_if = "Option::is_none")]` applied, so `None` fields are omitted from serialization. For your own types, you need to add this explicitly:

```rust
#[derive(Serialize)]
struct MyStatus {
phase: String,
#[serde(skip_serializing_if = "Option::is_none")]
message: Option<String>,
}
```

Without `skip_serializing_if`, `None` fields serialize as `null` and SSA takes ownership of them.
Comment thread
clux marked this conversation as resolved.

--8<-- "includes/abbreviations.md"
--8<-- "includes/links.md"

[//begin]: # "Autogenerated link references for markdown compatibility"
[reconciler]: reconciler "The Reconciler"
[//end]: # "Autogenerated link references"
Loading