Skip to content

LWS does not support zero-downtime updates when replicas = 1 #688

@bcfre

Description

@bcfre

What happened:
When I deploy a LeaderWorkerSet (LWS) with only one replica, I want to leverage maxSurge to perform a zero-downtime rollout. However, our current implementation seems to not support this scenario.

What you expected to happen:
Rolling update is vital to online services with zero downtime.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy LWS with maxSurge = 2 and replicas = 1:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: leaderworkerset-rollout
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1
      maxSurge: 2
  replicas: 1
  leaderWorkerTemplate:
    size: 1
    workerTemplate:
      spec:
        containers:
        - name: nginx
          image: nginxinc/nginx-unprivileged:1.27
          resources:
            limits:
              cpu: "100m"
            requests:
              cpu: "50m"
          ports:
          - containerPort: 8080
  1. Once the LWS is ready, redeploy with an invalid image:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: leaderworkerset-rollout
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1
      maxSurge: 2
  replicas: 1
  leaderWorkerTemplate:
    size: 1
    workerTemplate:
      spec:
        containers:
        - name: nginx
          image: nginxinc/nginx-unprivileged:1.27-not-exist
          resources:
            limits:
              cpu: "100m"
            requests:
              cpu: "50m"
          ports:
          - containerPort: 8080
  1. There only one Pod, and that Pod in ImagePullBackOff state.
k get po -l leaderworkerset.sigs.k8s.io/name=leaderworkerset-rollout
NAME                        READY   STATUS             RESTARTS   AGE
leaderworkerset-rollout-0   0/1     ImagePullBackOff   0          48s

Anything else we need to know?:
I suspect this issue is related to the following code snippet:

  1. In this section, the code prematurely scales down the expected maxSurge in order to gradually remove surge replicas. This causes finalReplicas to go back to lws.replica = 1.
// wantReplicas calculates the final replicas if needed.
wantReplicas := func(unreadyReplicas int32) int32 {
    if unreadyReplicas <= int32(maxSurge) {
        // When we have n unready replicas and n bursted replicas, we should
        // start to release the burst replica gradually for the accommodation of
        // the unready ones.
        finalReplicas := lwsReplicas + utils.NonZeroValue(int32(unreadyReplicas)-1)
        r.Record.Eventf(lws, corev1.EventTypeNormal, GroupsProgressing,
            fmt.Sprintf("deleting surge replica %s-%d", lws.Name, finalReplicas))
        return finalReplicas
    }
    return burstReplicas
}

Environment:

  • Kubernetes version (use kubectl version): Client Version: v1.34.0
    Kustomize Version: v5.7.1
    Server Version: v1.34.1-aliyun.1
  • LWS version (use git describe --tags --dirty --always): v0.7.0
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions