fix: Resolve k8sd-proxy-related reconciliation errors #146

claudiubelu · 2025-05-06T09:51:13Z

Currently, the cluster-api-k8s controllers are being filled with reconciliation errors because the k8sd-proxy does not yet exist on a node, or it's not yet Ready.

On error, the reconciliation request is put on an exponential backoff queue, which results in those requests being solved later and later. This can cause delays in various CAPI-related operations, such as scaling the number of nodes (requesting join tokens), certificate refreshes, and so on.

In case of a k8sd-proxy related error (Pod does not yet exist, or it's not Ready), we're now deferring the request.

louiseschmidtgen

Overall, lgtm but I'll leave approval up to our CAPI experts. One question:

louiseschmidtgen · 2025-05-09T12:31:43Z

pkg/errors/errors.go

+		notReadyErr *K8sdProxyNotReady
+	)
+	if errors.As(err, &notFoundErr) || errors.As(err, &notReadyErr) {
+		return ctrl.Result{RequeueAfter: 30 * time.Second}, nil


Why choose 30 seconds? It seems quite long. How long does the service usually take to become ready?

To be fair, the value is a bit arbitrary. We could lower it to something like 15 seconds.

The k8sd-proxy pods can take quite a bit to spawn. Based on a test's run (in a test run that didn't have this PR), we can see in the Certificates Controller Reconciler, we can see how many k8sd-proxy related Reconciler errors there are, and how long it can take until they're no longer an issue (from: https://github.com/canonical/cluster-api-k8s/actions/runs/14885411990/job/41834498342)

2025-05-07T22:45:51Z ERROR Reconciler error {"controller": "ck8sconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "CK8sConfig", "CK8sConfig": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "59ed7295-ad38-44ee-9119-1268fc2955b4", "error": "failed to request join token: failed to create k8sd proxy: failed to get proxy pods: there isn't any k8sd-proxy pods in target cluster"} . . . 2025-05-07T22:46:02Z ERROR Reconciler error {"controller": "ck8sconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "CK8sConfig", "CK8sConfig": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "1e4eade8-44be-41bd-967d-b57e581e9ea1", "error": "failed to request join token: failed to create k8sd proxy: failed to get k8sd proxy for control plane, previous errors: pod 'k8sd-proxy-w626x' is not Ready"} . . . 2025-05-07T22:46:34Z ERROR Reconciler error {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "a78fbd92-b84c-400a-b3c2-d25e5ebd4207", "error": "failed to get certificates expiry date: failed to create k8sd proxy: missing k8sd proxy pod for node capick8s-certificate-refresh-opv64c-worker-md-0-cld5g-7dvb9"} . . . 2025-05-07T22:47:37Z ERROR Reconciler error {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "ddb63da9-22b4-4183-a7e3-c20a47fc0f05", "error": "failed to get certificates expiry date: failed to create k8sd proxy: missing k8sd proxy pod for node capick8s-certificate-refresh-opv64c-worker-md-0-cld5g-7dvb9"} . . . 2025-05-07T22:47:51Z DEBUG events Certificates refresh in progress. TTL: 1y {"type": "Normal", "object": {"kind":"Machine","namespace":"workload-cluster-certificate-refresh-u270u9","name":"worker-md-0-cld5g-7dvb9","uid":"b96ba028-7986-41da-9f52-9b9df1b6f7cc","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"1879"}, "reason": "CertificatesRefreshInProgress"} 2025-05-07T22:47:51Z INFO controllers.Certificates Certificates refreshed {"namespace": "workload-cluster-certificate-refresh-u270u9", "machine": "worker-md-0-cld5g-7dvb9", "cluster": "capick8s-certificate-refresh-opv64c", "machine": "worker-md-0-cld5g-7dvb9", "expiry": "2026-05-07T22:47:51Z"} ...

So, 2 minutes, more or less.

claudiubelu · 2025-05-12T00:38:28Z

Bootstrap controller manager logs without this change: https://paste.ubuntu.com/p/C4rftSV4mS/
Bootstrap controller manager logs with this change: https://paste.ubuntu.com/p/KSt5bYtB6C/

Currently, the cluster-api-k8s controllers are being filled with reconciliation errors because the k8sd-proxy does not yet exist on a node, or it's not yet Ready. On error, the reconciliation request is put on an exponential backoff queue, which results in those requests being solved later and later. This can cause delays in various CAPI-related operations, such as scaling the number of nodes (requesting join tokens), certificate refreshes, and so on. In case of a k8sd-proxy related error (Pod does not yet exist, or it's not Ready), we're now deferring the request.

claudiubelu requested a review from a team as a code owner May 6, 2025 09:51

claudiubelu force-pushed the reconciliation branch 3 times, most recently from f7fadeb to fd24930 Compare May 9, 2025 11:03

claudiubelu changed the title ~~WIP: Reconciliation~~ fix: Resolve k8sd-proxy-related reconciliation errors May 9, 2025

louiseschmidtgen reviewed May 9, 2025

View reviewed changes

claudiubelu force-pushed the reconciliation branch 2 times, most recently from 4cdd44f to c9617c8 Compare May 12, 2025 08:58

claudiubelu force-pushed the reconciliation branch from c9617c8 to 8bbcf07 Compare May 20, 2025 18:06

github-actions bot added the Stale label Oct 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Resolve k8sd-proxy-related reconciliation errors #146

fix: Resolve k8sd-proxy-related reconciliation errors #146

Uh oh!

claudiubelu commented May 6, 2025 •

edited

Loading

Uh oh!

louiseschmidtgen left a comment

Uh oh!

louiseschmidtgen May 9, 2025

Uh oh!

claudiubelu May 9, 2025 •

edited

Loading

Uh oh!

claudiubelu commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Resolve k8sd-proxy-related reconciliation errors #146

Are you sure you want to change the base?

fix: Resolve k8sd-proxy-related reconciliation errors #146

Uh oh!

Conversation

claudiubelu commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louiseschmidtgen left a comment

Choose a reason for hiding this comment

Uh oh!

louiseschmidtgen May 9, 2025

Choose a reason for hiding this comment

Uh oh!

claudiubelu May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claudiubelu commented May 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claudiubelu commented May 6, 2025 •

edited

Loading

claudiubelu May 9, 2025 •

edited

Loading