Skip to content

When adding multiple nodes at a time, not all nodes get added to the loadbalancer #1035

@mikkeldamsgaard

Description

@mikkeldamsgaard

TL;DR

We use cluster api to manage our kubernetes clusters in Hetzner. We have setup the hcloud-cloud-controller-manager as a daemonset. We use a label selector to determine which nodes to add to the load balancer.

When we add one node, it seems to work fine with adding that one node to the loadbalancer.
However, when we add multiple nodes within a short timeframe, by scaling the machinedeployment in cluster api, then only a fraction of the new nodes are added to the loadbalancer.

If we restart the hcloud-cloud-controller-manager pod all the nodes gets added at that point, so it seems to be a missing event that causes this.

DamonSet (slightly redacted)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "4"
    meta.helm.sh/release-name: hccm
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2025-09-17T06:53:58Z"
  generation: 4
  labels:
    app.kubernetes.io/managed-by: Helm
  name: hcloud-cloud-controller-manager
  namespace: kube-system
  resourceVersion: "2328749"
  uid: f8677337-d242-471b-b933-a866149ab792
spec:
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app.kubernetes.io/instance: hccm
      app.kubernetes.io/name: hcloud-cloud-controller-manager
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: hccm
        app.kubernetes.io/name: hcloud-cloud-controller-manager
    spec:
      containers:
      - command:
        - /bin/hcloud-cloud-controller-manager
        - --allow-untagged-cloud
        - --cloud-provider=hcloud
        - --route-reconciliation-period=30s
        - --webhook-secure-port=0
        env:
        - name: HCLOUD_TOKEN
          valueFrom:
            secretKeyRef:
              key: hcloud
              name: ...
        - name: ROBOT_PASSWORD
          valueFrom:
            secretKeyRef:
              key: robot-password
              name: ...
              optional: true
        - name: ROBOT_USER
          valueFrom:
            secretKeyRef:
              key: robot-user
              name: ...
              optional: true
        image: <privaterepo>/hetznercloud/hcloud-cloud-controller-manager:v1.25.1
        imagePullPolicy: IfNotPresent
        name: hcloud-cloud-controller-manager
        ports:
        - containerPort: 8233
          name: metrics
          protocol: TCP
        resources:
          requests:
            cpu: 100m
            memory: 50Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: Default
      imagePullSecrets:
      - name: kubelet-pull
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      priorityClassName: system-cluster-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: hcloud-cloud-controller-manager
      serviceAccountName: hcloud-cloud-controller-manager
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node.cloudprovider.kubernetes.io/uninitialized
        value: "true"
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
        operator: Exists
      - effect: NoSchedule
        key: node-role.kubernetes.io/control-plane
        operator: Exists
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 1
  desiredNumberScheduled: 1
  numberAvailable: 1
  numberMisscheduled: 0
  numberReady: 1
  observedGeneration: 4
  updatedNumberScheduled: 1

And our service is (slightly redacted)

apiVersion: v1
kind: Service
metadata:
  annotations:
    load-balancer.hetzner.cloud/hostname: ...
    load-balancer.hetzner.cloud/location: hel1
    load-balancer.hetzner.cloud/name: nginx-ingress-gateway-d585053
    load-balancer.hetzner.cloud/node-selector: node.cluster.x-k8s.io/pool=system
    load-balancer.hetzner.cloud/uses-proxyprotocol: "true"
    meta.helm.sh/release-name: nginx
    meta.helm.sh/release-namespace: nginx
  creationTimestamp: "2025-09-18T14:18:34Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: nginx
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
    app.kubernetes.io/version: 1.11.5
    helm.sh/chart: ingress-nginx-4.11.5
  name: nginx-controller
  namespace: nginx
  resourceVersion: "176492"
  uid: 99f53edb-2fc3-4fed-9df8-e78ee9e41b03
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.0.14.15
  clusterIPs:
  - 10.0.14.15
  externalTrafficPolicy: Local
  healthCheckNodePort: 31567
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - appProtocol: http
    name: http
    nodePort: 32756
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    nodePort: 31456
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: nginx
    app.kubernetes.io/name: ingress-nginx
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - hostname: ...

A node that did not get added

apiVersion: v1
kind: Node
metadata:
  annotations:
    cluster.x-k8s.io/annotations-from-machine: ""
    cluster.x-k8s.io/cluster-name: redacted
    cluster.x-k8s.io/cluster-namespace: resources
    cluster.x-k8s.io/labels-from-machine: node-role.kubernetes.io/worker,node.cluster.x-k8s.io/pool
    cluster.x-k8s.io/machine: redacted
    cluster.x-k8s.io/owner-kind: MachineSet
    cluster.x-k8s.io/owner-name: redacted
    csi.volume.kubernetes.io/nodeid: '{"csi.hetzner.cloud":"redacted"}'
    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2025-09-24T16:23:21Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: cpx31
    beta.kubernetes.io/os: linux
    csi.hetzner.cloud/location: hel1
    failure-domain.beta.kubernetes.io/region: hel1
    failure-domain.beta.kubernetes.io/zone: hel1-dc2
    instance.hetzner.cloud/provided-by: cloud
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: redacted
    kubernetes.io/os: linux
    node-role.kubernetes.io/worker: ""
    node.cluster.x-k8s.io/pool: system
    node.kubernetes.io/instance-type: cpx31
    topology.kubernetes.io/region: hel1
    topology.kubernetes.io/zone: hel1-dc2
  name:redacted
  resourceVersion: "2346916"
  uid: cb639ab5-e783-4ad5-9ec4-a31667c7a04a
spec:
  podCIDR: 10.0.27.0/24
  podCIDRs:
  - 10.0.27.0/24
  providerID: hcloud://109584714
status:
  addresses:
  - address: redacted
    type: Hostname
  - address: redacted
    type: ExternalIP
  allocatable:
    cpu: "4"
    ephemeral-storage: "144873219447"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 7834952Ki
    pods: "220"
  capacity:
    cpu: "4"
    ephemeral-storage: 157197504Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 7937352Ki
    pods: "220"
  conditions:
  - lastHeartbeatTime: "2025-09-24T16:23:46Z"
    lastTransitionTime: "2025-09-24T16:23:46Z"
    message: Cilium is running on this node
    reason: CiliumIsUp
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2025-09-24T17:26:08Z"
    lastTransitionTime: "2025-09-24T16:23:20Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2025-09-24T17:26:08Z"
    lastTransitionTime: "2025-09-24T16:23:20Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2025-09-24T17:26:08Z"
    lastTransitionTime: "2025-09-24T16:23:20Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2025-09-24T17:26:08Z"
    lastTransitionTime: "2025-09-24T16:23:21Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - <redacted>
  nodeInfo:
    architecture: amd64
    bootID: b3c388ca-64f4-4675-b37d-33724743c3d7
    containerRuntimeVersion: containerd://2.1.4
    kernelVersion: 6.8.0-71-generic
    kubeProxyVersion: v1.32.9
    kubeletVersion: v1.32.9
    machineID: af2cf0b1f8de463697d37918daf1d42e
    operatingSystem: linux
    osImage: Ubuntu 24.04.3 LTS
    systemUUID: c916c3e3-bb3b-4cd5-8be6-44d782387a11
  runtimeHandlers:
  - features:
      recursiveReadOnlyMounts: true
      userNamespaces: true
    name: ""
  - features:
      recursiveReadOnlyMounts: true
      userNamespaces: true
    name: runc

Expected behavior

All nodes with the correct label gets added to the load balancer

Observed behavior

Only some of the new nodes gets added to the loadbalancer

Minimal working example

No response

Log output


Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions