Skip to content

Commit ba34a80

Browse files
committed
add error handling documentation and common troubleshooting patterns
1 parent 2213b87 commit ba34a80

4 files changed

Lines changed: 515 additions & 1 deletion

File tree

docs/controllers/errors.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Error Handling
2+
3+
Errors in kube originate from multiple layers. Understanding where each error comes from and how to handle it is key to building resilient controllers.
4+
5+
## Error Layers
6+
7+
```mermaid
8+
graph TD
9+
A["Client::send()"] -->|"network / TLS / timeout"| E1["kube::Error::HyperError\nkube::Error::HttpError"]
10+
B["Api::list() / get() / patch()"] -->|"4xx / 5xx"| E2["kube::Error::Api"]
11+
B -->|"deserialization failure"| E3["kube::Error::SerializationError"]
12+
C["watcher()"] -->|"initial LIST failed"| E4["watcher::Error::InitialListFailed"]
13+
C -->|"WATCH connect failed"| E5["watcher::Error::WatchFailed"]
14+
C -->|"server error during WATCH"| E6["watcher::Error::WatchError"]
15+
D["Controller::run()"] -->|"trigger stream"| C
16+
D -->|"user code"| E7["reconciler Error"]
17+
18+
style E1 fill:#ffebee
19+
style E2 fill:#ffebee
20+
style E7 fill:#fff3e0
21+
```
22+
23+
| Layer | Error type | Typical cause |
24+
|-------|-----------|---------------|
25+
| Client | `HyperError`, `HttpError` | Network, TLS, timeout |
26+
| [Api] | `Error::Api { status }` | Kubernetes 4xx/5xx response |
27+
| [Api] | `SerializationError` | JSON deserialization failure |
28+
| [watcher] | `InitialListFailed` | Initial LIST call failed |
29+
| [watcher] | `WatchFailed` | WATCH connection failed |
30+
| [watcher] | `WatchError` | Server error during WATCH (e.g. 410 Gone) |
31+
| [Controller] | reconciler Error | Error from user code |
32+
33+
## Watcher Errors and Backoff
34+
35+
Watcher errors are **soft errors** — the [watcher] retries on all failures (including 403s, network issues) because external circumstances may improve. They should never be silently discarded. See the [troubleshooting page](../troubleshooting.md#watcher-errors) for diagnostic examples.
36+
37+
The critical requirement is attaching a backoff to the watcher stream:
38+
39+
```rust
40+
// ✗ First error terminates the stream → controller stops
41+
let stream = watcher(api, wc);
42+
43+
// ✓ Exponential backoff with automatic retry
44+
let stream = watcher(api, wc).default_backoff();
45+
```
46+
47+
### default_backoff
48+
49+
Applies an `ExponentialBackoff`: 800ms → 1.6s → 3.2s → ... → 30s (max). The backoff resets whenever a successful event is received.
50+
51+
### Custom backoff
52+
53+
```rust
54+
use backon::ExponentialBuilder;
55+
56+
let stream = watcher(api, wc).backoff(
57+
ExponentialBuilder::default()
58+
.with_min_delay(Duration::from_millis(500))
59+
.with_max_delay(Duration::from_secs(30)),
60+
);
61+
```
62+
63+
## Reconciler Errors
64+
65+
### Defining error types
66+
67+
[Controller::run] requires specific trait bounds on the error type, so `anyhow::Error` cannot be used directly. Define a concrete error type with [thiserror]:
68+
69+
```rust
70+
#[derive(Debug, thiserror::Error)]
71+
enum Error {
72+
#[error("Kubernetes API error: {0}")]
73+
KubeApi(#[from] kube::Error),
74+
75+
#[error("Missing spec field: {0}")]
76+
MissingField(String),
77+
78+
#[error("External service error: {0}")]
79+
External(String),
80+
}
81+
```
82+
83+
### error_policy
84+
85+
When the reconciler returns `Err`, the `error_policy` function decides what happens next:
86+
87+
```rust
88+
fn error_policy(obj: Arc<MyResource>, err: &Error, ctx: Arc<Context>) -> Action {
89+
tracing::error!(?err, "reconcile failed");
90+
Action::requeue(Duration::from_secs(5))
91+
}
92+
```
93+
94+
You can distinguish transient from permanent errors:
95+
96+
| Type | Examples | Handling |
97+
|------|----------|---------|
98+
| Transient | Network error, timeout, 429 | Requeue via `error_policy` |
99+
| Permanent | Invalid spec, bad config | Record condition on status + `Action::await_change()` |
100+
101+
```rust
102+
fn error_policy(obj: Arc<MyResource>, err: &Error, ctx: Arc<Context>) -> Action {
103+
match err {
104+
// Transient: retry
105+
Error::KubeApi(_) | Error::External(_) => {
106+
Action::requeue(Duration::from_secs(5))
107+
}
108+
// Permanent: don't retry until the object changes
109+
Error::MissingField(_) => Action::await_change(),
110+
}
111+
}
112+
```
113+
114+
!!! note "Current limitations"
115+
116+
`error_policy` is a **synchronous** function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself — see the pattern described in the [[reconciler]] documentation.
117+
118+
## Client-level Retry
119+
120+
kube-client does not include built-in retry for regular API calls. If a `create()`, `patch()`, or `get()` fails, the error is returned as-is.
121+
122+
For automatic retry, you can use [tower]'s retry middleware. However, not all errors are retryable:
123+
124+
| Error | Retryable | Reason |
125+
|-------|-----------|--------|
126+
| 5xx | Yes | Server-side transient failure |
127+
| Timeout | Yes | Temporary network issue |
128+
| 429 Too Many Requests | Yes | Rate limit — wait and retry |
129+
| Network error | Yes | Temporary connectivity failure |
130+
| 4xx (400, 403, 404) | No | The request itself is wrong |
131+
| 409 Conflict | No | SSA ownership conflict — fix the logic |
132+
133+
## Timeout Strategy
134+
135+
The default `read_timeout` on [Client] is 295 seconds (matching the Kubernetes server-side watch timeout). This means a regular [Api] call could block for nearly 5 minutes if the server is unresponsive.
136+
137+
### Separate clients
138+
139+
```rust
140+
// Watcher client (default 295s timeout — needed for watch)
141+
let watcher_client = Client::try_default().await?;
142+
143+
// API call client (short timeout)
144+
let mut config = Config::infer().await?;
145+
config.read_timeout = Some(Duration::from_secs(15));
146+
let api_client = Client::try_from(config)?;
147+
```
148+
149+
### Wrapping individual calls
150+
151+
```rust
152+
let pod = tokio::time::timeout(
153+
Duration::from_secs(10),
154+
api.get("my-pod"),
155+
).await??;
156+
```
157+
158+
### Controllers
159+
160+
In a [Controller] context, the watcher needs the long timeout. Only the API calls inside your reconciler need shorter timeouts. Wrapping individual reconciler calls with `tokio::time::timeout` is usually sufficient.
161+
162+
--8<-- "includes/abbreviations.md"
163+
--8<-- "includes/links.md"
164+
165+
[//begin]: # "Autogenerated link references for markdown compatibility"
166+
[reconciler]: reconciler "The Reconciler"
167+
[//end]: # "Autogenerated link references"

docs/controllers/ssa.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# Server-Side Apply
2+
3+
[Server-Side Apply] is a Kubernetes patch strategy based on field ownership. It allows multiple controllers to safely modify the same resource by tracking which controller owns which fields.
4+
5+
This page covers practical patterns, common pitfalls, and status patching with SSA in kube.
6+
7+
!!! note "SSA and Reconciler Idempotency"
8+
9+
SSA naturally fits the [[reconciler]]'s idempotent pattern: you declare "these fields should have these values", and the server handles the rest. See [[reconciler#in-depth-solution]] for how SSA simplifies reconciler logic.
10+
11+
## Why SSA
12+
13+
The traditional patch strategies each have limitations:
14+
15+
| Strategy | Limitation |
16+
|----------|-----------|
17+
| Merge patch | Overwrites entire arrays. Field deletion is not explicit |
18+
| Strategic merge patch | Only works with k8s-openapi types. Incomplete for CRDs |
19+
| JSON patch | Requires exact paths. Susceptible to race conditions |
20+
21+
SSA addresses these:
22+
23+
- **Field ownership**: the server records "this controller owns this field"
24+
- **Conflict detection**: touching another owner's field produces a `409 Conflict`
25+
- **Declarative**: you declare which fields should have which values; everything else is left untouched
26+
27+
## Basic Pattern
28+
29+
```rust
30+
use kube::api::{Patch, PatchParams};
31+
32+
let patch = Patch::Apply(serde_json::json!({
33+
"apiVersion": "v1",
34+
"kind": "ConfigMap",
35+
"metadata": { "name": "my-cm" },
36+
"data": { "key": "value" }
37+
}));
38+
let pp = PatchParams::apply("my-controller"); // field manager name
39+
api.patch("my-cm", &pp, &patch).await?;
40+
```
41+
42+
The `"my-controller"` string in `PatchParams::apply` is the **field manager** name. Ownership is tracked under this name. Applying again with the same field manager updates owned fields; fields owned by other managers are left alone.
43+
44+
## Common Pitfalls
45+
46+
### Missing apiVersion and kind
47+
48+
```rust
49+
// ✗ 400 Bad Request
50+
let patch = Patch::Apply(serde_json::json!({
51+
"data": { "key": "value" }
52+
}));
53+
54+
// ✓ apiVersion and kind are required
55+
let patch = Patch::Apply(serde_json::json!({
56+
"apiVersion": "v1",
57+
"kind": "ConfigMap",
58+
"metadata": { "name": "my-cm" },
59+
"data": { "key": "value" }
60+
}));
61+
```
62+
63+
Unlike merge patch, SSA requires `apiVersion` and `kind` in every request.
64+
65+
### Missing field manager
66+
67+
```rust
68+
// ✗ Uses default field manager → unintended ownership conflicts
69+
let pp = PatchParams::default();
70+
71+
// ✓ Explicit field manager
72+
let pp = PatchParams::apply("my-controller");
73+
```
74+
75+
Always specify an explicit field manager. Without one, you risk ownership collisions with other controllers or kubectl users.
76+
77+
### Overusing force
78+
79+
```rust
80+
// Caution: forcibly takes ownership of fields from other managers
81+
let pp = PatchParams::apply("my-controller").force();
82+
```
83+
84+
`force: true` takes ownership of fields from other controllers. Only use this in single-owner situations such as CRD registration.
85+
86+
### Including unnecessary fields
87+
88+
Serializing an entire Rust struct includes `Default` value fields. SSA takes ownership of those fields, causing conflicts when another controller tries to modify them.
89+
90+
```rust
91+
// ✗ Serializes all Default fields → unnecessary ownership
92+
let full_deployment = Deployment { ..Default::default() };
93+
94+
// ✓ Only include fields you actually manage
95+
let patch = serde_json::json!({
96+
"apiVersion": "apps/v1",
97+
"kind": "Deployment",
98+
"metadata": { "name": "my-deploy" },
99+
"spec": {
100+
"replicas": 3
101+
}
102+
});
103+
```
104+
105+
## Status Patching
106+
107+
Status is modified through the `/status` subresource:
108+
109+
```rust
110+
let status_patch = serde_json::json!({
111+
"apiVersion": "example.com/v1",
112+
"kind": "MyResource",
113+
"status": {
114+
"phase": "Ready",
115+
"conditions": [{
116+
"type": "Available",
117+
"status": "True",
118+
"lastTransitionTime": "2024-01-01T00:00:00Z",
119+
}]
120+
}
121+
});
122+
let pp = PatchParams::apply("my-controller");
123+
api.patch_status("name", &pp, &Patch::Apply(status_patch)).await?;
124+
```
125+
126+
!!! warning "Wrap status in the full object structure"
127+
128+
```rust
129+
// ✗ Sending just the status fields will fail
130+
serde_json::json!({ "phase": "Ready" })
131+
132+
// ✓ Must include apiVersion, kind, and wrap under "status"
133+
serde_json::json!({
134+
"apiVersion": "example.com/v1",
135+
"kind": "MyResource",
136+
"status": { "phase": "Ready" }
137+
})
138+
```
139+
140+
The Kubernetes API expects the full object structure even on the `/status` endpoint.
141+
142+
## Typed SSA
143+
144+
Instead of `serde_json::json!()`, you can use Rust types for type safety and IDE autocompletion:
145+
146+
```rust
147+
let cm = ConfigMap {
148+
metadata: ObjectMeta {
149+
name: Some("my-cm".into()),
150+
..Default::default()
151+
},
152+
data: Some(BTreeMap::from([("key".into(), "value".into())])),
153+
..Default::default()
154+
};
155+
let pp = PatchParams::apply("my-controller");
156+
api.patch("my-cm", &pp, &Patch::Apply(cm)).await?;
157+
```
158+
159+
[k8s-openapi] types already have `#[serde(skip_serializing_if = "Option::is_none")]` applied, so `None` fields are omitted from serialization. For your own types, you need to add this explicitly:
160+
161+
```rust
162+
#[derive(Serialize)]
163+
struct MyStatus {
164+
phase: String,
165+
#[serde(skip_serializing_if = "Option::is_none")]
166+
message: Option<String>,
167+
}
168+
```
169+
170+
Without `skip_serializing_if`, `None` fields serialize as `null` and SSA takes ownership of them.
171+
172+
--8<-- "includes/abbreviations.md"
173+
--8<-- "includes/links.md"
174+
175+
[//begin]: # "Autogenerated link references for markdown compatibility"
176+
[reconciler]: reconciler "The Reconciler"
177+
[//end]: # "Autogenerated link references"

0 commit comments

Comments
 (0)