Replies: 1 comment
-
You should not have an even number of etcd nodes, see: https://docs.rke2.io/install/ha
That said, this would appear to be a rancher provisioning issue, not RKE2. You can check the Rancher logs to see what the CAPI controllers are waiting for in order to begin the process of installing and configuring RKE2 on this node. Also, Kubernetes 1.24 has been end of life since August 2023. You should work on upgrading this cluster, or build a new cluster on a non-EOL version and migrate your workloads over. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Environmental Info:
RKE2 Version: v1.24.17+rke2r1
Node(s) CPU architecture, OS, and Version: Linux xxx 5.15.0-112-generic #122-Ubuntu SMP Thu May 23 07:48:21 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: 4 etcd/control plane + 5 workers
Describe the bug:
I have a custom RKE2 cluster with 4 etcd/control plane nodes and 5 worker nodes. We recently experienced a crash, and one of the etcd/control plane nodes is no longer able to rejoin the cluster.
The crash occurred over a year after the cluster was created, and we suspect it might be related to certificate expiration, as the documentation indicates certificates expire after one year.
We have rebooted the entire platform, and all nodes are working correctly except for the fourth control plane node. I even tried provisioning a brand new server to join as the fourth node, and the issue persists.
To join the fourth node, I use the command found under Registration (Cluster Management) in the Rancher GUI. I’ve also made sure to fully clean the node each time before attempting to join it, following the steps outlined in this Rancher guide:
https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/manage-clusters/clean-cluster-nodes
The node consistently gets stuck with the error: "Waiting for node ref".
The rke2-server service never starts at all — there is no output related to RKE2.
Steps To Reproduce:
Expected behavior:
What I expect is for this fourth node to be able to join the cluster.
Actual behavior:
Additional context / logs:
The rancher-system-agent service does start correctly. In its logs (journalctl -u rancher-system-agent), I can see messages like:
Apr 29 10:54:30 xxx rancher-system-agent[31303]: E0429 10:54:30.799603 31303 memcache.go:206] couldn't get resource list for management.cattle.io/v3:
Apr 29 10:54:30 xxx rancher-system-agent[31303]: time="2025-04-29T10:54:30+01:00" level=info msg="Starting /v1, Kind=Secret controller"
Beta Was this translation helpful? Give feedback.
All reactions