Skip to content

BUG: data race between control plane components and host modification #5614

@A2ureStone

Description

@A2ureStone

Sealos Version

v5.0.1

How to reproduce the bug?

I attempted to test how Sealos recovers after a failure of the master0 node. During this, I discovered an issue: on control plane nodes other than master0, the host file on the host machine and the host file inside the pods are inconsistent. For example, in the controller-manager pod, when master0 becomes unavailable, controller-manager fails. The same behavior is observed in other control plane components as well.

Reproduce

The cluster is configured with three control plane nodes: master1, master2, and master3, and one worker node node1. The cluster was started with the following command:

root@master1:~# sealos gen registry.cn-shanghai.aliyuncs.com/labring/kubernetes:v1.29.9 registry.cn-shanghai.aliyuncs.com/labring/helm:v3.9.4 registry.cn-shanghai.aliyuncs.com/labring/cilium:v1.13.4 \
     --masters 192.168.64.15,192.168.64.16,192.168.64.17 \
     --nodes 192.168.64.18 \
     -u root --pk='/root/.ssh/multipass_key' \
     --output Clusterfile
sealos apply -f Clusterfile

Then shutdown the master1. Below is the output from controller-manager on master2 after master1 (master0) went offline. As you can see, the DNS resolution of apiserver.cluster.local points to master1 (master0).

root@master2:~# kubectl logs -n kube-system kube-controller-manager-master2 | tail -n10
E0526 13:58:46.446458       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:58:52.590254       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:58:58.736356       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:59:01.811818       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:59:04.879391       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:59:11.023681       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:59:14.104059       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:59:17.166370       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:59:20.240788       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host
E0526 13:59:26.395983       1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://apiserver.cluster.local:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": dial tcp 192.168.64.15:6443: connect: no route to host

By directly inspecting the hosts file under /var/lib/kubelet/pods, I confirmed this was indeed the case.

root@master2:~# grep -i "apiserver.cluster.local" -r /var/lib/kubelet/pods/
/var/lib/kubelet/pods/56a1a6061487d03b440de1b2e6d4cba5/etc-hosts:192.168.64.15 apiserver.cluster.local
/var/lib/kubelet/pods/e669d082174b1b8e93a3b80fa1a4a2b9/etc-hosts:192.168.64.15 apiserver.cluster.local
/var/lib/kubelet/pods/14812d81b7c93b918d4faed7ae4a6dcf/etc-hosts:192.168.64.15 apiserver.cluster.local
/var/lib/kubelet/pods/a27a8174-6ac0-4f5d-81fe-17a05a525c43/etc-hosts:192.168.64.15 apiserver.cluster.local
/var/lib/kubelet/pods/a27a8174-6ac0-4f5d-81fe-17a05a525c43/volumes/kubernetes.io~configmap/kube-proxy/..2025_05_26_02_51_19.923862820/kubeconfig.conf:    server: https://apiserver.cluster.local:6443
/var/lib/kubelet/pods/a9d78507a0d74897c9c340c682a3413f/etc-hosts:192.168.64.15 apiserver.cluster.local
/var/lib/kubelet/pods/cfb54c6d-8bd4-4533-b2e1-8b9b74d47e43/etc-hosts:192.168.64.16 apiserver.cluster.local
root@master2:~# ls /var/lib/kubelet/pods/*/containers
/var/lib/kubelet/pods/14812d81b7c93b918d4faed7ae4a6dcf/containers:
kube-scheduler

/var/lib/kubelet/pods/362d9138-58c5-4961-b8b9-39d9aa57f8d7/containers:
coredns

/var/lib/kubelet/pods/56a1a6061487d03b440de1b2e6d4cba5/containers:
kube-controller-manager

/var/lib/kubelet/pods/a27a8174-6ac0-4f5d-81fe-17a05a525c43/containers:
kube-proxy

/var/lib/kubelet/pods/a9d78507a0d74897c9c340c682a3413f/containers:
etcd

/var/lib/kubelet/pods/cfb54c6d-8bd4-4533-b2e1-8b9b74d47e43/containers:
apply-sysctl-overwrites  cilium-agent  clean-cilium-state  config  install-cni-binaries  mount-bpf-fs  mount-cgroup

/var/lib/kubelet/pods/e669d082174b1b8e93a3b80fa1a4a2b9/containers:
kube-apiserver

/var/lib/kubelet/pods/e9afef92-76ae-461a-9fde-f105f5521e07/containers:
coredns

We can see that pod 56a1a6061487d03b440de1b2e6d4cba5 is kube-controller-manager with host record
192.168.64.15 apiserver.cluster.local.
However, the host machine’s /etc/hosts file looks like this:

root@master2:~# cat /etc/hosts
# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.debian.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
#     /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 master2 master2
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
192.168.64.15 sealos.hub
192.168.64.16 apiserver.cluster.local

And not all control plane component pods have their apiserver.cluster.local entry pointing to master1 (master0) — some point to the host machine instead.
After reading parts of Sealos’ source code, I suspect this inconsistency is caused by a data race. During the initialization of control plane nodes, Sealos modifies the host machine’s /etc/hosts file twice: the first time during the init phase, it checks whether the node is a master, and if so, it points to master0. This makes sense because during join master, the node locates master0 through apiserver.cluster.local. The second modification happens after the kubeadm join command completes, at which point it only waits for the API server to become ready, but not all control plane components.
The first host modification in init stage

defaultInitializers = append(defaultInitializers, &registryHostApplier{}, &registryApplier{}, &defaultCRIInitializer{}, &apiServerHostApplier{}, &lvscareHostApplier{}, &defaultInitializer{})

func (a *apiServerHostApplier) Apply(ctx Context, host string) error {
    if slices.Contains(ctx.GetCluster().GetMasterIPAndPortList(), host) {
    	if err := ctx.GetRemoter().HostsAdd(host, ctx.GetCluster().GetMaster0IP(), constants.DefaultAPIServerDomain); err != nil {
    		return fmt.Errorf("failed to add hosts: %v", err)
    	}
    	return nil
    }
    if err := ctx.GetRemoter().HostsAdd(host, ctx.GetCluster().GetVIP(), constants.DefaultAPIServerDomain); err != nil {
    	return fmt.Errorf("failed to add hosts: %v", err)
    }
    
    return nil
}

The second host modification after kubeadm join

func (k *KubeadmRuntime) joinMasters(masters []string) error {
    // ...
    err = k.sshCmdAsync(master, joinCmd)
    if err != nil {
    	return fmt.Errorf("exec kubeadm join in %s failed %v", master, err)
    }
    
    err = k.execHostsAppend(master, master, k.getAPIServerDomain())
    if err != nil {
    	return fmt.Errorf("add master0 apiserver domain hosts in %s failed %v", master, err)
    }
    // ...
}

The possible cases

case1 apiserver.cluster.local points to master0: first host modification --> pod start --> second host modification
case2 apiserver.cluster.local points to host machine: first host modification --> second host modification --> pod start 

Here is the doc about behavior of kubeadm join

Without the feature gate enabled, kubeadm will only wait for the kube-apiserver on a control plane node to become ready. 
The wait process starts right after the kubelet on the host is started by kubeadm. 
You are advised to enable this feature gate in case you wish to observe a ready state from all control plane components 
during the kubeadm init or kubeadm join command execution.

What is the expected behavior?

No response

What do you see instead?

No response

Operating environment

- Sealos version:
- Docker version:
- Kubernetes version:
- Operating system:
- Runtime environment:
- Cluster size:
- Additional information:

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions