Skip to content

Resilience issue when losing a node in a P2P HA setup #3834

@jnamdar

Description

@jnamdar

Kairos version:

PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
KAIROS_ARCH="arm64"
KAIROS_BUG_REPORT_URL="https://github.com/kairos-io/kairos/issues"
KAIROS_FAMILY="debian"
KAIROS_FIPS="false"
KAIROS_FLAVOR="ubuntu"
KAIROS_FLAVOR_RELEASE="22.04"
KAIROS_HOME_URL="https://github.com/kairos-io/kairos"
KAIROS_ID="kairos"
KAIROS_ID_LIKE="kairos-standard-ubuntu-22.04"
KAIROS_INIT_VERSION="v0.5.7"
KAIROS_MODEL="rpi4"
KAIROS_NAME="kairos-standard-ubuntu-22.04"
KAIROS_RELEASE="v3.5.0"
KAIROS_SOFTWARE_VERSION="v1.33.2+k3s1"
KAIROS_SOFTWARE_VERSION_PREFIX="k3s"
KAIROS_TARGETARCH="arm64"
KAIROS_TRUSTED_BOOT="false"
KAIROS_VARIANT="standard"
KAIROS_VERSION="v3.5.0"

CPU architecture, OS, and Version:

Linux kairos-be24 5.15.0-1080-raspi #83-Ubuntu SMP PREEMPT Fri May 30 13:44:46 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
Describe the bug

I wanted to test the overall resiliency of a multi node setup in KairOS as mentioned here : https://kairos.io/docs/architecture/network/, "Resilient: Kairos ensures that the cluster remains resilient, even in the face of network disruptions or failures. By using VirtualIPs, nodes can communicate with each other without the need for static IPs, and the cluster’s etcd database remains unaffected by any disruptions."

Basically I set up a 4 nodes cluster using 4 identical Raspberry Pi 4B and configured my p2p section (fitting this example https://kairos.io/docs/examples/multi-node-p2p-ha/) as such :

p2p:
  # Enforce the discovery of nodes on local network
  disable_dht: true
  network_token: "<TOKEN_HERE>"
  # Automatic cluster deployment configuration
  auto:
    # Enables Automatic node configuration (self-coordination)
    # for role assignment
    enable: true
    # HA enables automatic HA roles assignment.
    # A master cluster init is always required,
    # Any additional master_node is configured as part of the
    # HA control plane.
    # If auto is disabled, HA has no effect.
    ha:
      # Enables HA control-plane
      enable: true
      # Number of HA additional master nodes.
      # A master node is always required for creating the cluster and is implied.
      # The setting below adds 2 additional master nodes, for a total of 3.
      master_nodes: 2

At first, everything goes according to plan. I get a HA controlplane with 3 nodes elected as masters (including one as the clusterinit) :

kairos@kairos-be24:~$ sudo kairos role list
Node                                             Role                            IP             
-----------------------------------------------  ------------------------------  ---------------
690b186899e44cf3b7139659271d9d5f-kairos-690b     worker                          10.1.0.4       
0e34593a16cc4eab96b149dfa653ac79-kairos-0e34     master/ha                       10.1.0.2       
be249c08f7f043f0a16243985875afc2-kairos-be24     master/clusterinit              10.1.0.1       
dd24ec0920764fec8748213c161b9079-kairos-dd24     master/ha                       10.1.0.3       
kairos@kairos-be24:~$ sudo kubectl get node -o wide
NAME                   STATUS   ROLES                       AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
kairos-0e34            Ready    control-plane,etcd,master   19h   v1.33.2+k3s1   10.1.0.2      <none>        Ubuntu 22.04.5 LTS   5.15.0-1080-raspi   containerd://2.0.5-k3s1
kairos-690b-d939c31f   Ready    <none>                      19h   v1.33.2+k3s1   10.1.0.4      <none>        Ubuntu 22.04.5 LTS   5.15.0-1080-raspi   containerd://2.0.5-k3s1
kairos-be24            Ready    control-plane,etcd,master   19h   v1.33.2+k3s1   10.1.0.1      <none>        Ubuntu 22.04.5 LTS   5.15.0-1080-raspi   containerd://2.0.5-k3s1
kairos-dd24            Ready    control-plane,etcd,master   19h   v1.33.2+k3s1   10.1.0.3      <none>        Ubuntu 22.04.5 LTS   5.15.0-1080-raspi   containerd://2.0.5-k3s1

Then, I shut down one of my nodes and monitor the resiliency : I chose kairos-be24. I can see logs and events indicating the node lost becomes "NotReady" from k3s point of view, and I still have a connection to my cluster, which is good so far. Note that, at this point, I haven't checked the output of sudo kairos role list again, but I suspect it chose to change one of my worker roles into a master role at this point.

I then plug the node back online, but this time something odd happens. I can confirm KairOS is updating the roles to try and fit the HA control plane I requested in my configuration (3 nodes) when checking the output :

kairos@kairos-be24:~$ sudo kairos role list
Node                                             Role                            IP             
-----------------------------------------------  ------------------------------  ---------------
690b186899e44cf3b7139659271d9d5f-kairos-690b     master/ha                       10.1.0.4       
0e34593a16cc4eab96b149dfa653ac79-kairos-0e34     master/ha                       10.1.0.2       
be249c08f7f043f0a16243985875afc2-kairos-be24     worker                                         
dd24ec0920764fec8748213c161b9079-kairos-dd24     master/clusterinit              10.1.0.3       

But the missing IP address on the worker is odd. Finally, the output of kubectl get nodes hasn't changed :

kairos@kairos-be24:~$ sudo kubectl get node -o wide
NAME                   STATUS   ROLES                       AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
kairos-0e34            Ready    control-plane,etcd,master   19h   v1.33.2+k3s1   10.1.0.2      <none>        Ubuntu 22.04.5 LTS   5.15.0-1080-raspi   containerd://2.0.5-k3s1
kairos-690b-d939c31f   Ready    <none>                      19h   v1.33.2+k3s1   10.1.0.4      <none>        Ubuntu 22.04.5 LTS   5.15.0-1080-raspi   containerd://2.0.5-k3s1
kairos-be24            Ready    control-plane,etcd,master   19h   v1.33.2+k3s1   10.1.0.1      <none>        Ubuntu 22.04.5 LTS   5.15.0-1080-raspi   containerd://2.0.5-k3s1
kairos-dd24            Ready    control-plane,etcd,master   19h   v1.33.2+k3s1   10.1.0.3      <none>        Ubuntu 22.04.5 LTS   5.15.0-1080-raspi   containerd://2.0.5-k3s1

which is weird, since there is now a mismatch between KairOS roles' choice, and k3s nodes roles.

Logs from kairos-agent confirm the node-be24 does not seem to update its configuration regarding the role change :

Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Active: '12D3KooWB1d4UmjUL97qi4N9W7QpzjGGNnmbu7LWTyM3daz47gK7'
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Active: '12D3KooWGTFxGnmimku1gZe8MREqvvC97u8NczZXj39pEMvDYHeX'
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Active: '12D3KooWLHQpgrmtvjPTEmV93wY7XyoX1qXszdiwJx6fokWzHfKc'
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Active: '12D3KooWNAdx5HXJpyuid4uXcKh6QfjntMvVE6bvnL7xxNu2KMTG'
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Applying role 'auto'
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Role loaded. Applying auto
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Active nodes:[12D3KooWB1d4UmjUL97qi4N9W7QpzjGGNnmbu7LWTyM3daz47gK7 12D3KooWGTFxGnmimku1gZe8MREqvvC97u8NczZXj39pEMvDYHeX 12D3KooWLHQpgrmtvjPTEmV93wY7XyoX1qXszdiwJx6fokWzHfKc 12D3KooWNAdx5HXJpyuid4uXcKh6QfjntMvVE6bvnL7xxNu2KMTG]
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Advertizing nodes:[dd24ec0920764fec8748213c161b9079-kairos-dd24 be249c08f7f043f0a16243985875afc2-kairos-be24 0e34593a16cc4eab96b149dfa653ac79-kairos-0e34 690b186899e44cf3b7139659271d9d5f-kairos-690b]
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: <be249c08f7f043f0a16243985875afc2-kairos-be24> not a leader, leader is '690b186899e44cf3b7139659271d9d5f-kairos-690b', sleeping
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Roles assigned
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Applying role 'worker'
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Role loaded. Applying worker
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Starting Worker
Dec 18 13:50:03 kairos-be24 kairos-provider[1657]: Node already configured, backing off

I am guessing since the node has already been configured once with the master role, a sentinel file prevents it from reconfiguring (https://github.com/kairos-io/provider-kairos/blob/0f254b788f23c65360bbdd0fd86d522107a5b9df/internal/role/p2p/worker.go#L28).

The same logs from the node who went worker -> master show the same thing in reverse :

Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Active: '12D3KooWNAdx5HXJpyuid4uXcKh6QfjntMvVE6bvnL7xxNu2KMTG'
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Active: '12D3KooWB1d4UmjUL97qi4N9W7QpzjGGNnmbu7LWTyM3daz47gK7'
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Active: '12D3KooWGTFxGnmimku1gZe8MREqvvC97u8NczZXj39pEMvDYHeX'
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Active: '12D3KooWLHQpgrmtvjPTEmV93wY7XyoX1qXszdiwJx6fokWzHfKc'
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Applying role 'auto'
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Role loaded. Applying auto
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Active nodes:[12D3KooWB1d4UmjUL97qi4N9W7QpzjGGNnmbu7LWTyM3daz47gK7 12D3KooWGTFxGnmimku1gZe8MREqvvC97u8NczZXj39pEMvDYHeX 12D3KooWLHQpgrmtvjPTEmV93wY7XyoX1qXszdiwJx6fokWzHfKc 12D3KooWNAdx5HXJpyuid4uXcKh6QfjntMvVE6bvnL7xxNu2KMTG]
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Advertizing nodes:[be249c08f7f043f0a16243985875afc2-kairos-be24 0e34593a16cc4eab96b149dfa653ac79-kairos-0e34 690b186899e44cf3b7139659271d9d5f-kairos-690b dd24ec0920764fec8748213c161b9079-kairos-dd24]
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: I'm the leader. My UUID is: 690b186899e44cf3b7139659271d9d5f-kairos-690b.
                                                    Current assigned roles: map[0e34593a16cc4eab96b149dfa653ac79-kairos-0e34:master/ha 690b186899e44cf3b7139659271d9d5f-kairos-690b:master/ha be249c08f7f043f0a16243985875afc2-kairos-be24:worker dd24ec0920764fec8748213c161b9079-kairos-dd24:master/clusterinit]
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Master already present: true
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Unassigned nodes: []
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Done scheduling
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Roles assigned
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Applying role 'master/ha'
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Role loaded. Applying master/ha
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Starting Master(master/ha)
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Checking role assignment
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Determining K8s distro
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Verifying sentinel file
Dec 18 14:18:06 kairos-690b kairos-provider[1607]: Node already configured, propagating master data and backing off

Ultimately my question would be: is that use case supported ? Can KairOS withstand node failure ? Should I have completely reset my node before plugging it back online ? Even if I did that, the node KairOS elected at first as a worker and then changed to a master did not reconfigure, and is still a worker from k3s point of view, which will probably cause issues at some point.

All in all I think the documentation around the resiliency could be upgraded to further explain the resiliency aspect promoted in the p2p docs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageAdd this label to issues that should be triaged and prioretized in the next planning callunconfirmed

    Type

    No type

    Projects

    Status

    Todo 🖊

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions