-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
What happened?
While running the upgrade cluster playbook the Cri-O (crio) systemd service is not restarted. This results in the Cri-O versions not being upgraded during the playbook run and causing outages later. This happens during a Kubernetes version upgrade from version 1.32.11 to version 1.33.7 using the latest Kubespray release v2.30.0.
You can see that the minor release version of kubelet and crio are not in sync on the node after the node is upgraded. On manual restart of the full node or the crio systemd service manually, the node does report the newer Cri-O version. So Cri-O seems to be upgraded, the restart just doesn't happen anymore.
What did you expect to happen?
During previous upgrades (k8s 1.31 to 1.32 with Kubespray version v2.29.0) the crio systemd service is successfully restarted while the node is cordoned during the upgrade phase of the nodes. You can see that the minor release version of kubelet and crio are in sync on the node after the node is upgraded.
How can we reproduce it (as minimally and precisely as possible)?
Upgrade a cluster that's using the cri-o container runtime from version 1.32 to version 1.33 using the latest kubespray release. You can see that the cri-o version of the node is not up to date with the kubelet version of the node.
kubectl get nodes -o wide
On restart of the node it comes back online and crio is (re)started with the correct upgraded version.
OS
Ubuntu 24
Version of Ansible
ansible [core 2.17.7]
Version of Python
3.12.0
Version of Kubespray (commit)
v2.30.0
Network plugin used
custom_cni
Full inventory with variables
Too sensitive to share
Command used to invoke ansible
ansible-playbook -i "./inventory/dev-1.yaml" ./playbooks/upgrade_cluster.yaml -e "upgrade_node_confirm=true" -e "upgrade_node_post_upgrade_confirm=true"
Output of ansible run
Not the ansible run output but here's the overview of the cluster nodes that visualises the problem:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control-01 Ready control-plane 21h v1.33.7 10.71.13.35 <none> Ubuntu 24.04.4 LTS 6.8.0-100-generic cri-o://1.33.8
control-02 Ready control-plane 21h v1.33.7 10.71.13.36 <none> Ubuntu 24.04.4 LTS 6.8.0-100-generic cri-o://1.32.12
control-03 Ready control-plane 21h v1.33.7 10.71.13.37 <none> Ubuntu 24.04.4 LTS 6.8.0-100-generic cri-o://1.32.12
worker-01 Ready <none> 21h v1.33.7 10.71.13.41 <none> Ubuntu 24.04.4 LTS 6.8.0-100-generic cri-o://1.32.12
worker-02 Ready <none> 21h v1.33.7 10.71.13.42 <none> Ubuntu 24.04.4 LTS 6.8.0-100-generic cri-o://1.32.12
worker-03 Ready <none> 20h v1.33.7 10.71.13.43 <none> Ubuntu 24.04.4 LTS 6.8.0-100-generic cri-o://1.32.12
Node control-01 was restarted manually which resulted in this node having the correct upgraded cri-o version. All other nodes do not because the service wasn't restarted...
Anything else we need to know
No response