-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Summary
This issue is related to an upstream bug (1) and is often blocking k8s cluster deployments.
The symptom is Cilium pods going into CrashLoopBackOff state upon new cluster deployments:
$ kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default csi-rbdplugin-provisioner-6f5fb97894-78wtm 0/6 Pending 0 10m
default csi-rbdplugin-provisioner-6f5fb97894-rq9tk 0/6 Pending 0 10m
default csi-rbdplugin-provisioner-6f5fb97894-rqjjh 0/6 Pending 0 10m
kube-system cilium-gh6mk 0/1 CrashLoopBackOff 7 (8s ago) 11m
kube-system cilium-hsj46 0/1 CrashLoopBackOff 7 (86s ago) 12m
kube-system cilium-operator-7467567bb8-5d8kz 1/1 Running 1 (11m ago) 13m
kube-system cilium-xq546 0/1 CrashLoopBackOff 6 (4m50s ago) 11m
kube-system ck-storage-rawfile-csi-controller-0 0/2 Pending 0 13m
kube-system ck-storage-rawfile-csi-node-lmt86 0/4 ContainerCreating 0 13m
kube-system ck-storage-rawfile-csi-node-v98lj 0/4 ContainerCreating 0 11m
kube-system ck-storage-rawfile-csi-node-xgk87 0/4 ContainerCreating 0 11m
kube-system coredns-fc9c778db-hnb5w 0/1 Pending 0 13m
kube-system metrics-server-8694c96fb7-pdwsj 0/1 Pending 0 13m
Reading Cilium pod logs shows:
time="2025-09-05T19:32:31.412648669Z" level=fatal msg="failed to start: daemon creation failed: unable to determine direct routing device. Use --direct-routing-device to specify it\nfailed to stop: unable to find controller ipcache-inject-labels" subsys=daemon
The current workaround is to use a juju config to tell the Cilium pods which interface is handling the routing:
juju config k8s cluster-annotations="k8sd/v1alpha1/cilium/devices=ens+"
After which the pods are able to successfully bootstrap.
This issue has been observed on on both OpenStack backed and MAAS backed deployments.
What Should Happen Instead?
The Cilium bootstrapping process should happen automatically w/o a manual intervention.
Reproduction Steps
Example Bundle to reproduce the issue:
default-base: [email protected]/stable
applications:
k8s:
charm: k8s
channel: 1.32/stable
num_units: 3
to:
- "0"
- "1"
- "2"
expose: true
options:
bootstrap-pod-cidr: 172.16.0.0/16
bootstrap-service-cidr: 10.152.183.0/24
ingress-enabled: true
constraints: arch=amd64
machines:
"0":
constraints: arch=amd64 instance-type=c1.large zones=az1
"1":
constraints: arch=amd64 instance-type=c1.large zones=az1
"2":
constraints: arch=amd64 instance-type=c1.large zones=az1Example bundle including the workaround:
default-base: [email protected]/stable
applications:
k8s:
charm: k8s
channel: 1.32/stable
num_units: 3
to:
- "0"
- "1"
- "2"
expose: true
options:
bootstrap-pod-cidr: 172.16.0.0/16
bootstrap-service-cidr: 10.152.183.0/24
cluster-annotations: k8sd/v1alpha1/cilium/devices=ens+
ingress-enabled: true
constraints: arch=amd64
machines:
"0":
constraints: arch=amd64 instance-type=c1.large zones=az1
"1":
constraints: arch=amd64 instance-type=c1.large zones=az1
"2":
constraints: arch=amd64 instance-type=c1.large zones=az1Please change devices=ens+ to match target NIC names.
System information
This is reproducible on k8s 1.32 LTS on Ubuntu 22.04-LTS as well as Ubuntu 24.04-LTS.
As stated above, MAAS backed and OpenStack backed deployments show the same behavior.
Can you suggest a fix?
As suggested on the upstream bug (1), it would be nice if we can detect the physical NICs and configure this automatically for Cilium to bootstrap.
Are you interested in contributing with a fix?
Yes!