-
Notifications
You must be signed in to change notification settings - Fork 129
Description
TL;DR
We use cluster api to manage our kubernetes clusters in Hetzner. We have setup the hcloud-cloud-controller-manager as a daemonset. We use a label selector to determine which nodes to add to the load balancer.
When we add one node, it seems to work fine with adding that one node to the loadbalancer.
However, when we add multiple nodes within a short timeframe, by scaling the machinedeployment in cluster api, then only a fraction of the new nodes are added to the loadbalancer.
If we restart the hcloud-cloud-controller-manager pod all the nodes gets added at that point, so it seems to be a missing event that causes this.
DamonSet (slightly redacted)
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "4"
meta.helm.sh/release-name: hccm
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2025-09-17T06:53:58Z"
generation: 4
labels:
app.kubernetes.io/managed-by: Helm
name: hcloud-cloud-controller-manager
namespace: kube-system
resourceVersion: "2328749"
uid: f8677337-d242-471b-b933-a866149ab792
spec:
revisionHistoryLimit: 2
selector:
matchLabels:
app.kubernetes.io/instance: hccm
app.kubernetes.io/name: hcloud-cloud-controller-manager
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/instance: hccm
app.kubernetes.io/name: hcloud-cloud-controller-manager
spec:
containers:
- command:
- /bin/hcloud-cloud-controller-manager
- --allow-untagged-cloud
- --cloud-provider=hcloud
- --route-reconciliation-period=30s
- --webhook-secure-port=0
env:
- name: HCLOUD_TOKEN
valueFrom:
secretKeyRef:
key: hcloud
name: ...
- name: ROBOT_PASSWORD
valueFrom:
secretKeyRef:
key: robot-password
name: ...
optional: true
- name: ROBOT_USER
valueFrom:
secretKeyRef:
key: robot-user
name: ...
optional: true
image: <privaterepo>/hetznercloud/hcloud-cloud-controller-manager:v1.25.1
imagePullPolicy: IfNotPresent
name: hcloud-cloud-controller-manager
ports:
- containerPort: 8233
name: metrics
protocol: TCP
resources:
requests:
cpu: 100m
memory: 50Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: Default
imagePullSecrets:
- name: kubelet-pull
nodeSelector:
node-role.kubernetes.io/control-plane: ""
priorityClassName: system-cluster-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: hcloud-cloud-controller-manager
serviceAccountName: hcloud-cloud-controller-manager
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 1
desiredNumberScheduled: 1
numberAvailable: 1
numberMisscheduled: 0
numberReady: 1
observedGeneration: 4
updatedNumberScheduled: 1
And our service is (slightly redacted)
apiVersion: v1
kind: Service
metadata:
annotations:
load-balancer.hetzner.cloud/hostname: ...
load-balancer.hetzner.cloud/location: hel1
load-balancer.hetzner.cloud/name: nginx-ingress-gateway-d585053
load-balancer.hetzner.cloud/node-selector: node.cluster.x-k8s.io/pool=system
load-balancer.hetzner.cloud/uses-proxyprotocol: "true"
meta.helm.sh/release-name: nginx
meta.helm.sh/release-namespace: nginx
creationTimestamp: "2025-09-18T14:18:34Z"
finalizers:
- service.kubernetes.io/load-balancer-cleanup
labels:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: nginx
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
app.kubernetes.io/version: 1.11.5
helm.sh/chart: ingress-nginx-4.11.5
name: nginx-controller
namespace: nginx
resourceVersion: "176492"
uid: 99f53edb-2fc3-4fed-9df8-e78ee9e41b03
spec:
allocateLoadBalancerNodePorts: true
clusterIP: 10.0.14.15
clusterIPs:
- 10.0.14.15
externalTrafficPolicy: Local
healthCheckNodePort: 31567
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- appProtocol: http
name: http
nodePort: 32756
port: 80
protocol: TCP
targetPort: http
- appProtocol: https
name: https
nodePort: 31456
port: 443
protocol: TCP
targetPort: https
selector:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: nginx
app.kubernetes.io/name: ingress-nginx
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- hostname: ...
A node that did not get added
apiVersion: v1
kind: Node
metadata:
annotations:
cluster.x-k8s.io/annotations-from-machine: ""
cluster.x-k8s.io/cluster-name: redacted
cluster.x-k8s.io/cluster-namespace: resources
cluster.x-k8s.io/labels-from-machine: node-role.kubernetes.io/worker,node.cluster.x-k8s.io/pool
cluster.x-k8s.io/machine: redacted
cluster.x-k8s.io/owner-kind: MachineSet
cluster.x-k8s.io/owner-name: redacted
csi.volume.kubernetes.io/nodeid: '{"csi.hetzner.cloud":"redacted"}'
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2025-09-24T16:23:21Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: cpx31
beta.kubernetes.io/os: linux
csi.hetzner.cloud/location: hel1
failure-domain.beta.kubernetes.io/region: hel1
failure-domain.beta.kubernetes.io/zone: hel1-dc2
instance.hetzner.cloud/provided-by: cloud
kubernetes.io/arch: amd64
kubernetes.io/hostname: redacted
kubernetes.io/os: linux
node-role.kubernetes.io/worker: ""
node.cluster.x-k8s.io/pool: system
node.kubernetes.io/instance-type: cpx31
topology.kubernetes.io/region: hel1
topology.kubernetes.io/zone: hel1-dc2
name:redacted
resourceVersion: "2346916"
uid: cb639ab5-e783-4ad5-9ec4-a31667c7a04a
spec:
podCIDR: 10.0.27.0/24
podCIDRs:
- 10.0.27.0/24
providerID: hcloud://109584714
status:
addresses:
- address: redacted
type: Hostname
- address: redacted
type: ExternalIP
allocatable:
cpu: "4"
ephemeral-storage: "144873219447"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 7834952Ki
pods: "220"
capacity:
cpu: "4"
ephemeral-storage: 157197504Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 7937352Ki
pods: "220"
conditions:
- lastHeartbeatTime: "2025-09-24T16:23:46Z"
lastTransitionTime: "2025-09-24T16:23:46Z"
message: Cilium is running on this node
reason: CiliumIsUp
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2025-09-24T17:26:08Z"
lastTransitionTime: "2025-09-24T16:23:20Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2025-09-24T17:26:08Z"
lastTransitionTime: "2025-09-24T16:23:20Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2025-09-24T17:26:08Z"
lastTransitionTime: "2025-09-24T16:23:20Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2025-09-24T17:26:08Z"
lastTransitionTime: "2025-09-24T16:23:21Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- <redacted>
nodeInfo:
architecture: amd64
bootID: b3c388ca-64f4-4675-b37d-33724743c3d7
containerRuntimeVersion: containerd://2.1.4
kernelVersion: 6.8.0-71-generic
kubeProxyVersion: v1.32.9
kubeletVersion: v1.32.9
machineID: af2cf0b1f8de463697d37918daf1d42e
operatingSystem: linux
osImage: Ubuntu 24.04.3 LTS
systemUUID: c916c3e3-bb3b-4cd5-8be6-44d782387a11
runtimeHandlers:
- features:
recursiveReadOnlyMounts: true
userNamespaces: true
name: ""
- features:
recursiveReadOnlyMounts: true
userNamespaces: true
name: runc
Expected behavior
All nodes with the correct label gets added to the load balancer
Observed behavior
Only some of the new nodes gets added to the loadbalancer
Minimal working example
No response
Log output
Additional information
No response