Skip to content

Cluster status is show as ready even when Cilium pods are crash-looping #1789

@motjuste

Description

@motjuste

Summary

This isn't happening everywhere, but there's a specific machine in Testflinger where this happens consistently (see details for it below).

Essentially, even when sudo k8s status --wait-ready --timeout 10m reports that the cluster status is ready, inspecting sudo k8s kubectl get po -A shows that Cilium pods are in CrashLoopBackOff.

The Cilium pods do not deploy, unless I restart k8s using sudo snap restart k8s, and all pods start working after that.

What Should Happen Instead?

sudo k8s status --wait-ready --timeout 10m must not report cluster status as ready if the Cilium pods are crash-looping.

Reproduction Steps

$ sudo snap install k8s --classic --channel 1.32-classic/stable
Bootstrapping the cluster. This may take a few seconds, please wait.
Bootstrapped a new Kubernetes cluster with node address "XX.XX.XX.219:6400".
The node will be 'Ready' to host workloads after the CNI is deployed successfully.
$ sudo k8s enable local-storage --timeout 5m
Enabling local-storage on the cluster. This may take a few seconds, please wait.
local-storage enabled.
$ sudo k8s status --wait-ready --timeout 10m
cluster status:           ready
control plane nodes:      XX.XX.XX.219:6400 (voter)
high availability:        no
datastore:                etcd
network:                  enabled
dns:                      enabled at XX.XX.XX.197
ingress:                  disabled
load-balancer:            disabled
local-storage:            enabled at /var/snap/k8s/common/rawfile-storage
gateway                   enabled
$ sudo k8s kubectl get po -A
NAMESPACE     NAME                                  READY   STATUS              RESTARTS       AGE
kube-system   cilium-285ff                          0/1     CrashLoopBackOff    5 (2m9s ago)   6m46s
kube-system   cilium-operator-6777577c56-4ppcr      0/1     CrashLoopBackOff    6 (67s ago)    8m12s
kube-system   ck-storage-rawfile-csi-controller-0   0/2     Pending             0              8m19s
kube-system   ck-storage-rawfile-csi-node-blsnd     0/4     ContainerCreating   0              8m19s
kube-system   coredns-fc9c778db-twql4               0/1     Pending             0              8m19s
kube-system   metrics-server-8694c96fb7-c8kfw       0/1     Pending             0              8m19s
$ sudo k8s inspect
Collecting service information
Running inspection on a control-plane node
 INFO:  Service k8s.containerd is running
 INFO:  Service k8s.etcd is running
 INFO:  Service k8s.kube-proxy is running
 INFO:  Service k8s.k8s-dqlite is not running
 WARNING:  Service k8s.k8s-dqlite should be running on this node
 INFO:  Service k8s.k8sd is running
 INFO:  Service k8s.kube-apiserver is running
 INFO:  Service k8s.kube-controller-manager is running
 INFO:  Service k8s.kube-scheduler is running
 INFO:  Service k8s.kubelet is running
Collecting registry mirror logs
Collecting service arguments
 INFO:  Copy service args to the final report tarball
Collecting k8s cluster-info
 INFO:  Copy k8s cluster-info dump to the final report tarball
Collecting SBOM
 INFO:  Copy SBOM to the final report tarball
Collecting system information
 INFO:  Copy processes list to the final report tarball
 INFO:  Copy disk usage information to the final report tarball
 INFO:  Copy /proc/mounts to the final report tarball
 INFO:  Copy memory usage information to the final report tarball
 INFO:  Copy swap information to the final report tarball
 INFO:  Copy node uptime to the final report tarball
 INFO:  Copy /etc/os-release to the final report tarball
 INFO:  Copy loaded kernel modules to the final report tarball
 INFO:  Copy dmesg entries
 INFO:  Collecting core dumps from /var/crash. Size: 1.2M	/var/crash
Collecting snap and related information
 INFO:  Copy uname to the final report tarball
 INFO:  Copy snap diagnostics to the final report tarball
 INFO:  Copy k8s diagnostics to the final report tarball
cp: cannot stat '/var/snap/k8s/common/var/lib/k8s-dqlite/cluster.yaml': No such file or directory
cp: cannot stat '/var/snap/k8s/common/var/lib/k8s-dqlite/info.yaml': No such file or directory
Collecting networking information
 INFO:  Copy network diagnostics to the final report tarball
Building the report tarball
 SUCCESS:  Report tarball is at /home/ubuntu/inspection-report-20250826_171132.tar.gz

System information

While testing on multiple machines, I have found this issue happen consistently in one particular machine in Testflinger. Please see this run in Testflinger.

I am attaching the report from sudo k8s inspect.

inspection-report-20250826_171132.tar.gz

Can you suggest a fix?

No response

Are you interested in contributing with a fix?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions