New etcd/control plane node not joining RKE2 cluster #8125

tempman5 · 2025-04-29T12:20:20Z

tempman5
Apr 29, 2025

Environmental Info:
RKE2 Version: v1.24.17+rke2r1

Node(s) CPU architecture, OS, and Version: Linux xxx 5.15.0-112-generic #122-Ubuntu SMP Thu May 23 07:48:21 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 4 etcd/control plane + 5 workers

Describe the bug:

I have a custom RKE2 cluster with 4 etcd/control plane nodes and 5 worker nodes. We recently experienced a crash, and one of the etcd/control plane nodes is no longer able to rejoin the cluster.

The crash occurred over a year after the cluster was created, and we suspect it might be related to certificate expiration, as the documentation indicates certificates expire after one year.

We have rebooted the entire platform, and all nodes are working correctly except for the fourth control plane node. I even tried provisioning a brand new server to join as the fourth node, and the issue persists.

To join the fourth node, I use the command found under Registration (Cluster Management) in the Rancher GUI. I’ve also made sure to fully clean the node each time before attempting to join it, following the steps outlined in this Rancher guide:
https://ranchermanager.docs.rancher.com/how-to-guides/new-user-guides/manage-clusters/clean-cluster-nodes

The node consistently gets stuck with the error: "Waiting for node ref".
The rke2-server service never starts at all — there is no output related to RKE2.

Steps To Reproduce:

Installed RKE2:

Expected behavior:

What I expect is for this fourth node to be able to join the cluster.

Actual behavior:

Additional context / logs:

The rancher-system-agent service does start correctly. In its logs (journalctl -u rancher-system-agent), I can see messages like:
Apr 29 10:54:30 xxx rancher-system-agent[31303]: E0429 10:54:30.799603 31303 memcache.go:206] couldn't get resource list for management.cattle.io/v3:
Apr 29 10:54:30 xxx rancher-system-agent[31303]: time="2025-04-29T10:54:30+01:00" level=info msg="Starting /v1, Kind=Secret controller"

brandond · 2025-04-30T18:57:10Z

brandond
Apr 30, 2025
Maintainer

custom RKE2 cluster with 4 etcd/control plane nodes

You should not have an even number of etcd nodes, see: https://docs.rke2.io/install/ha

Why An Odd Number Of Server Nodes?
An etcd cluster must be comprised of an odd number of server nodes for etcd to maintain quorum. For a cluster with n servers, quorum is (n/2)+1. For any odd-sized cluster, adding one node will always increase the number of nodes necessary for quorum. Although adding a node to an odd-sized cluster appears better since there are more machines, the fault tolerance is worse. Exactly the same number of nodes can fail without losing quorum, but there are now more nodes that can fail.

That said, this would appear to be a rancher provisioning issue, not RKE2. You can check the Rancher logs to see what the CAPI controllers are waiting for in order to begin the process of installing and configuring RKE2 on this node.

Also, Kubernetes 1.24 has been end of life since August 2023. You should work on upgrading this cluster, or build a new cluster on a non-EOL version and migrate your workloads over.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New etcd/control plane node not joining RKE2 cluster #8125

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

New etcd/control plane node not joining RKE2 cluster #8125

Uh oh!

tempman5 Apr 29, 2025

Replies: 1 comment

Uh oh!

Uh oh!

brandond Apr 30, 2025 Maintainer

tempman5
Apr 29, 2025

brandond
Apr 30, 2025
Maintainer