-
Notifications
You must be signed in to change notification settings - Fork 10
fix: Resolve k8sd-proxy-related reconciliation errors #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f7fadeb to
fd24930
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, lgtm but I'll leave approval up to our CAPI experts. One question:
| notReadyErr *K8sdProxyNotReady | ||
| ) | ||
| if errors.As(err, ¬FoundErr) || errors.As(err, ¬ReadyErr) { | ||
| return ctrl.Result{RequeueAfter: 30 * time.Second}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why choose 30 seconds? It seems quite long. How long does the service usually take to become ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, the value is a bit arbitrary. We could lower it to something like 15 seconds.
The k8sd-proxy pods can take quite a bit to spawn. Based on a test's run (in a test run that didn't have this PR), we can see in the Certificates Controller Reconciler, we can see how many k8sd-proxy related Reconciler errors there are, and how long it can take until they're no longer an issue (from: https://github.com/canonical/cluster-api-k8s/actions/runs/14885411990/job/41834498342)
2025-05-07T22:45:51Z ERROR Reconciler error {"controller": "ck8sconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "CK8sConfig", "CK8sConfig": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "59ed7295-ad38-44ee-9119-1268fc2955b4", "error": "failed to request join token: failed to create k8sd proxy: failed to get proxy pods: there isn't any k8sd-proxy pods in target cluster"}
.
.
.
2025-05-07T22:46:02Z ERROR Reconciler error {"controller": "ck8sconfig", "controllerGroup": "bootstrap.cluster.x-k8s.io", "controllerKind": "CK8sConfig", "CK8sConfig": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "1e4eade8-44be-41bd-967d-b57e581e9ea1", "error": "failed to request join token: failed to create k8sd proxy: failed to get k8sd proxy for control plane, previous errors: pod 'k8sd-proxy-w626x' is not Ready"}
.
.
.
2025-05-07T22:46:34Z ERROR Reconciler error {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "a78fbd92-b84c-400a-b3c2-d25e5ebd4207", "error": "failed to get certificates expiry date: failed to create k8sd proxy: missing k8sd proxy pod for node capick8s-certificate-refresh-opv64c-worker-md-0-cld5g-7dvb9"}
.
.
.
2025-05-07T22:47:37Z ERROR Reconciler error {"controller": "machine", "controllerGroup": "cluster.x-k8s.io", "controllerKind": "Machine", "Machine": {"name":"worker-md-0-cld5g-7dvb9","namespace":"workload-cluster-certificate-refresh-u270u9"}, "namespace": "workload-cluster-certificate-refresh-u270u9", "name": "worker-md-0-cld5g-7dvb9", "reconcileID": "ddb63da9-22b4-4183-a7e3-c20a47fc0f05", "error": "failed to get certificates expiry date: failed to create k8sd proxy: missing k8sd proxy pod for node capick8s-certificate-refresh-opv64c-worker-md-0-cld5g-7dvb9"}
.
.
.
2025-05-07T22:47:51Z DEBUG events Certificates refresh in progress. TTL: 1y {"type": "Normal", "object": {"kind":"Machine","namespace":"workload-cluster-certificate-refresh-u270u9","name":"worker-md-0-cld5g-7dvb9","uid":"b96ba028-7986-41da-9f52-9b9df1b6f7cc","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"1879"}, "reason": "CertificatesRefreshInProgress"}
2025-05-07T22:47:51Z INFO controllers.Certificates Certificates refreshed {"namespace": "workload-cluster-certificate-refresh-u270u9", "machine": "worker-md-0-cld5g-7dvb9", "cluster": "capick8s-certificate-refresh-opv64c", "machine": "worker-md-0-cld5g-7dvb9", "expiry": "2026-05-07T22:47:51Z"}
...
So, 2 minutes, more or less.
|
Bootstrap controller manager logs without this change: https://paste.ubuntu.com/p/C4rftSV4mS/ |
4cdd44f to
c9617c8
Compare
Currently, the cluster-api-k8s controllers are being filled with reconciliation errors because the k8sd-proxy does not yet exist on a node, or it's not yet Ready. On error, the reconciliation request is put on an exponential backoff queue, which results in those requests being solved later and later. This can cause delays in various CAPI-related operations, such as scaling the number of nodes (requesting join tokens), certificate refreshes, and so on. In case of a k8sd-proxy related error (Pod does not yet exist, or it's not Ready), we're now deferring the request.
c9617c8 to
8bbcf07
Compare
Currently, the
cluster-api-k8scontrollers are being filled with reconciliation errors because thek8sd-proxydoes not yet exist on a node, or it's not yetReady.On error, the reconciliation request is put on an exponential backoff queue, which results in those requests being solved later and later. This can cause delays in various CAPI-related operations, such as scaling the number of nodes (requesting join tokens), certificate refreshes, and so on.
In case of a
k8sd-proxyrelated error (Pod does not yet exist, or it's notReady), we're now deferring the request.