Description
What steps did you take and what happened?
After upgrading CAPI to 1.9 we observed an issue with CAPRKE2 provider.
RKE2 uses kubelet local mode by default, so etcd membership management logic behaves as in k/k 1.32 in Kubeadm.
The problem causes loss of API server access after etcd member is removed, leading to inability to proceed with infrastructure machine deletion.
The issue is that in rke2 deployments, kubelet is configured to use local api server (127.0.0.1:443), which in turn relies on local etcd pod. But as this node is removed from etcd cluster, kubelet won't be able to reach the API any more, and it will fail to properly drain the node as all pods will remain stuck in Terminating state from kubernetes perspective.
Logs from the cluster:
12:21:45.068153 1 recorder.go:104] "success waiting for node volumes detaching Machine's node \"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q\"" logger="events" type="Normal" object={"kind":"Machine","namespace":"create-workload-cluster-s51eu2","name":"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q","uid":"825607f8-f44e-465b-a954-ce3de1eb291c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"2116"} reason="NodeVolumesDetached"
12:21:56.066942 1 recorder.go:104] "error waiting for node volumes detaching, Machine's node \"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q\": failed to list VolumeAttachments: failed to list VolumeAttachments: Get \"https://172.18.0.3:6443/apis/storage.k8s.io/v1/volumeattachments?limit=100&timeout=10s\": context deadline exceeded" logger="events" type="Warning" object={"kind":"Machine","namespace":"create-workload-cluster-s51eu2","name":"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q","uid":"825607f8-f44e-465b-a954-ce3de1eb291c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"2162"} reason="FailedWaitForVolumeDetach"
12:21:56.087814 1 controller.go:316] "Reconciler error" err="failed to list VolumeAttachments: failed to list VolumeAttachments: Get \"https://172.18.0.3:6443/apis/storage.k8s.io/v1/volumeattachments?limit=100&timeout=10s\": context deadline exceeded" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="create-workload-cluster-s51eu2/caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q" namespace="create-workload-cluster-s51eu2" name="caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q" reconcileID="960a3889-d9b7-41a3-92c4-63f438b0c980"
What did you expect to happen?
Draining and Volume detachment to succeed, and machine get deleted without issues.
Cluster API version
v1.9.0
Kubernetes version
v1.29.2 - management
v1.31.0 - workload
Anything else you would like to add?
Logs from CI run with all details: https://github.com/rancher/cluster-api-provider-rke2/actions/runs/12372669685/artifacts/2332172988
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.