Skip to content

Cri-O not restarted anymore while node is uncordoned #13031

@robinvalk

Description

@robinvalk

What happened?

While running the upgrade cluster playbook the Cri-O (crio) systemd service is not restarted. This results in the Cri-O versions not being upgraded during the playbook run and causing outages later. This happens during a Kubernetes version upgrade from version 1.32.11 to version 1.33.7 using the latest Kubespray release v2.30.0.

You can see that the minor release version of kubelet and crio are not in sync on the node after the node is upgraded. On manual restart of the full node or the crio systemd service manually, the node does report the newer Cri-O version. So Cri-O seems to be upgraded, the restart just doesn't happen anymore.

What did you expect to happen?

During previous upgrades (k8s 1.31 to 1.32 with Kubespray version v2.29.0) the crio systemd service is successfully restarted while the node is cordoned during the upgrade phase of the nodes. You can see that the minor release version of kubelet and crio are in sync on the node after the node is upgraded.

How can we reproduce it (as minimally and precisely as possible)?

Upgrade a cluster that's using the cri-o container runtime from version 1.32 to version 1.33 using the latest kubespray release. You can see that the cri-o version of the node is not up to date with the kubelet version of the node.

kubectl get nodes -o wide

On restart of the node it comes back online and crio is (re)started with the correct upgraded version.

OS

Ubuntu 24

Version of Ansible

ansible [core 2.17.7]

Version of Python

3.12.0

Version of Kubespray (commit)

v2.30.0

Network plugin used

custom_cni

Full inventory with variables

Too sensitive to share

Command used to invoke ansible

ansible-playbook -i "./inventory/dev-1.yaml" ./playbooks/upgrade_cluster.yaml -e "upgrade_node_confirm=true" -e "upgrade_node_post_upgrade_confirm=true"

Output of ansible run

Not the ansible run output but here's the overview of the cluster nodes that visualises the problem:

NAME                                   STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
control-01   Ready    control-plane   21h   v1.33.7   10.71.13.35   <none>        Ubuntu 24.04.4 LTS   6.8.0-100-generic   cri-o://1.33.8
control-02   Ready    control-plane   21h   v1.33.7   10.71.13.36   <none>        Ubuntu 24.04.4 LTS   6.8.0-100-generic   cri-o://1.32.12
control-03   Ready    control-plane   21h   v1.33.7   10.71.13.37   <none>        Ubuntu 24.04.4 LTS   6.8.0-100-generic   cri-o://1.32.12
worker-01    Ready    <none>          21h   v1.33.7   10.71.13.41   <none>        Ubuntu 24.04.4 LTS   6.8.0-100-generic   cri-o://1.32.12
worker-02    Ready    <none>          21h   v1.33.7   10.71.13.42   <none>        Ubuntu 24.04.4 LTS   6.8.0-100-generic   cri-o://1.32.12
worker-03    Ready    <none>          20h   v1.33.7   10.71.13.43   <none>        Ubuntu 24.04.4 LTS   6.8.0-100-generic   cri-o://1.32.12

Node control-01 was restarted manually which resulted in this node having the correct upgraded cri-o version. All other nodes do not because the service wasn't restarted...

Anything else we need to know

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Ubuntu 24kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions