Skip to content

Machine fails to finish draining/volume detachment after successful completion #11591

Open
@Danil-Grigorev

Description

@Danil-Grigorev

What steps did you take and what happened?

After upgrading CAPI to 1.9 we observed an issue with CAPRKE2 provider.

RKE2 uses kubelet local mode by default, so etcd membership management logic behaves as in k/k 1.32 in Kubeadm.
The problem causes loss of API server access after etcd member is removed, leading to inability to proceed with infrastructure machine deletion.

The issue is that in rke2 deployments, kubelet is configured to use local api server (127.0.0.1:443), which in turn relies on local etcd pod. But as this node is removed from etcd cluster, kubelet won't be able to reach the API any more, and it will fail to properly drain the node as all pods will remain stuck in Terminating state from kubernetes perspective.

Logs from the cluster:

12:21:45.068153       1 recorder.go:104] "success waiting for node volumes detaching Machine's node \"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q\"" logger="events" type="Normal" object={"kind":"Machine","namespace":"create-workload-cluster-s51eu2","name":"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q","uid":"825607f8-f44e-465b-a954-ce3de1eb291c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"2116"} reason="NodeVolumesDetached"
12:21:56.066942       1 recorder.go:104] "error waiting for node volumes detaching, Machine's node \"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q\": failed to list VolumeAttachments: failed to list VolumeAttachments: Get \"https://172.18.0.3:6443/apis/storage.k8s.io/v1/volumeattachments?limit=100&timeout=10s\": context deadline exceeded" logger="events" type="Warning" object={"kind":"Machine","namespace":"create-workload-cluster-s51eu2","name":"caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q","uid":"825607f8-f44e-465b-a954-ce3de1eb291c","apiVersion":"cluster.x-k8s.io/v1beta1","resourceVersion":"2162"} reason="FailedWaitForVolumeDetach"
12:21:56.087814       1 controller.go:316] "Reconciler error" err="failed to list VolumeAttachments: failed to list VolumeAttachments: Get \"https://172.18.0.3:6443/apis/storage.k8s.io/v1/volumeattachments?limit=100&timeout=10s\": context deadline exceeded" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="create-workload-cluster-s51eu2/caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q" namespace="create-workload-cluster-s51eu2" name="caprke2-e2e-fukv6s-upgrade-control-plane-kzt8q" reconcileID="960a3889-d9b7-41a3-92c4-63f438b0c980"

What did you expect to happen?

Draining and Volume detachment to succeed, and machine get deleted without issues.

Cluster API version

v1.9.0

Kubernetes version

v1.29.2 - management
v1.31.0 - workload

Anything else you would like to add?

Logs from CI run with all details: https://github.com/rancher/cluster-api-provider-rke2/actions/runs/12372669685/artifacts/2332172988

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions