Skip to content

Conversation

@claudiubelu
Copy link
Contributor

@claudiubelu claudiubelu commented May 6, 2025

Currently, the cluster-api-k8s controllers are being filled with reconciliation errors because the k8sd-proxy does not yet exist on a node, or it's not yet Ready.

On error, the reconciliation request is put on an exponential backoff queue, which results in those requests being solved later and later. This can cause delays in various CAPI-related operations, such as scaling the number of nodes (requesting join tokens), certificate refreshes, and so on.

In case of a k8sd-proxy related error (Pod does not yet exist, or it's not Ready), we're now deferring the request.

@claudiubelu claudiubelu requested a review from a team as a code owner May 6, 2025 09:51
@claudiubelu claudiubelu force-pushed the reconciliation branch 3 times, most recently from f7fadeb to fd24930 Compare May 9, 2025 11:03
@claudiubelu claudiubelu changed the title WIP: Reconciliation fix: Resolve k8sd-proxy-related reconciliation errors May 9, 2025
Copy link
Contributor

@louiseschmidtgen louiseschmidtgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, lgtm but I'll leave approval up to our CAPI experts. One question:

notReadyErr *K8sdProxyNotReady
)
if errors.As(err, &notFoundErr) || errors.As(err, &notReadyErr) {
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why choose 30 seconds? It seems quite long. How long does the service usually take to become ready?

Copy link
Contributor Author

@claudiubelu claudiubelu May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, the value is a bit arbitrary. We could lower it to something like 15 seconds.

The k8sd-proxy pods can take quite a bit to spawn. Based on a test's run (in a test run that didn't have this PR), we can see in the Certificates Controller Reconciler, we can see how many k8sd-proxy related Reconciler errors there are, and how long it can take until they're no longer an issue (from: https://github.com/canonical/cluster-api-k8s/actions/runs/14885411990/job/41834498342)

2025-05-07T22:45:51Z	ERROR	Reconciler error	{"controller": "ck8sconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "CK8sConfig", "CK8sConfig": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "59ed7295-ad38-44ee-9119-1268fc2955b4", "error": "failed to request join token: failed to create k8sd proxy: failed to get proxy pods: there isn't any k8sd-proxy pods in target cluster"}
.
.
.
2025-05-07T22:46:02Z	ERROR	Reconciler error	{"controller": "ck8sconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "CK8sConfig", "CK8sConfig": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "1e4eade8-44be-41bd-967d-b57e581e9ea1", "error": "failed to request join token: failed to create k8sd proxy: failed to get k8sd proxy for control plane, previous errors: pod 'k8sd-proxy-w626x' is not Ready"}
.
.
.
2025-05-07T22:46:34Z	ERROR	Reconciler error	{"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "a78fbd92-b84c-400a-b3c2-d25e5ebd4207", "error": "failed to get certificates expiry date: failed to create k8sd proxy: missing k8sd proxy pod for node capick8s-certificate-refresh-opv64c-worker-md-0-cld5g-7dvb9"}
.
.
.
2025-05-07T22:47:37Z	ERROR	Reconciler error	{"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "ddb63da9-22b4-4183-a7e3-c20a47fc0f05", "error": "failed to get certificates expiry date: failed to create k8sd proxy: missing k8sd proxy pod for node capick8s-certificate-refresh-opv64c-worker-md-0-cld5g-7dvb9"}
.
.
.
2025-05-07T22:47:51Z	DEBUG	events	Certificates refresh in progress. TTL: 1y	{"type": "Normal", "object": {"kind":"Machine","namespace":"workload-cluster-certificate-refresh-u270u9","name":"worker-md-0-cld5g-7dvb9","uid":"b96ba028-7986-41da-9f52-9b9df1b6f7cc","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"1879"}, "reason": "CertificatesRefreshInProgress"}
2025-05-07T22:47:51Z	INFO	controllers.Certificates	Certificates refreshed	{"namespace": "workload-cluster-certificate-refresh-u270u9", "machine": "worker-md-0-cld5g-7dvb9", "cluster": "capick8s-certificate-refresh-opv64c", "machine": "worker-md-0-cld5g-7dvb9", "expiry": "2026-05-07T22:47:51Z"}
...

So, 2 minutes, more or less.

@claudiubelu
Copy link
Contributor Author

Bootstrap controller manager logs without this change: https://paste.ubuntu.com/p/C4rftSV4mS/
Bootstrap controller manager logs with this change: https://paste.ubuntu.com/p/KSt5bYtB6C/

@claudiubelu claudiubelu force-pushed the reconciliation branch 2 times, most recently from 4cdd44f to c9617c8 Compare May 12, 2025 08:58
Currently, the cluster-api-k8s controllers are being filled with
reconciliation errors because the k8sd-proxy does not yet exist on a
node, or it's not yet Ready.

On error, the reconciliation request is put on an exponential backoff
queue, which results in those requests being solved later and later.
This can cause delays in various CAPI-related operations, such as
scaling the number of nodes (requesting join tokens), certificate
refreshes, and so on.

In case of a k8sd-proxy related error (Pod does not yet exist, or it's
not Ready), we're now deferring the request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants