Skip to content

Rolling upgrades (rolling updates in CAPI term) lead to forever stuck pods in Terminating state and nodes stuck in NotReady,SchedulingDisabled #183

@ader1990

Description

@ader1990

Hello,

In the context of deploying Canonical Kubernetes with Sylva (https://sylva-projects.gitlab.io/), during rolling upgrades where MaxSurge=0, and etcd used as a data-store, the rolling upgrade gets stuck during the initial phase of moving the workloads from the to-be removed nodes to the remaining ones.

Context:

  1. Deploy a 4 node (3 CP and 1 MD) Canonical Kubernetes cluster, using the ca2509f image and metal3 as infra provider. Everything works as expected.
  2. Trigger a rolling upgrade via CAPI, using MaxSurge=0. 1 CP and 1 MD are set into SchedulingDisabled and the workloads from those two nodes should move to the remaining 2 CPs. Sometimes, the 1 CP and 1 MD are stuck with the status NotReady, without the SchedulingDisabled taint.

Expected result: the workloads should move to the remaining 2 CPs and the Kubernetes cluster should end up consisting of only 2 CPs (intermediary state, as afterwards another CP and another MD is provisioned.

Actual result: the workloads do not move, as the pods are forever stuck in Terminating state. The 1 CP and the 1 MD remain forever stuck in NotReady,SchedulingDisabled state.

Reproduceability: always.

OS used: Ubuntu Server 24.04 with latest updates.
k8s-snap Canonical Kubernetes used: 1.32.8, 1.32.9 with etcd. The variant with dqlite deployed does not show this issue (the dqlite variant fails to perform the rolling upgrades with other behaviour, unrelated to this one).

Investigation done: it was observed that on the to-be deleted CP the underlying k8s-snap etcd process does get killed very soon, way before the workloads have time to move, and the k8s status shows the node as not bootstrapped. It seems that the actual cleanup it is done in a very eager manner, and the Kubernetes cluster does not have the time to move the workloads or to respond to node removals.

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions