LWS does not support zero-downtime updates when replicas = 1

**What happened**:
When I deploy a LeaderWorkerSet (LWS) with only one replica, I want to leverage maxSurge to perform a zero-downtime rollout. However, our current implementation seems to not support this scenario.

**What you expected to happen**:
Rolling update is vital to online services with zero downtime. 

**How to reproduce it (as minimally and precisely as possible)**:
1. Deploy LWS with maxSurge = 2 and replicas = 1:
```yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: leaderworkerset-rollout
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1
      maxSurge: 2
  replicas: 1
  leaderWorkerTemplate:
    size: 1
    workerTemplate:
      spec:
        containers:
        - name: nginx
          image: nginxinc/nginx-unprivileged:1.27
          resources:
            limits:
              cpu: "100m"
            requests:
              cpu: "50m"
          ports:
          - containerPort: 8080
```
2. Once the LWS is ready, redeploy with an invalid image:
```yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: leaderworkerset-rollout
spec:
  rolloutStrategy:
    type: RollingUpdate
    rollingUpdateConfiguration:
      maxUnavailable: 1
      maxSurge: 2
  replicas: 1
  leaderWorkerTemplate:
    size: 1
    workerTemplate:
      spec:
        containers:
        - name: nginx
          image: nginxinc/nginx-unprivileged:1.27-not-exist
          resources:
            limits:
              cpu: "100m"
            requests:
              cpu: "50m"
          ports:
          - containerPort: 8080
```
3. There only one Pod, and that Pod in ImagePullBackOff state.
```shell
k get po -l leaderworkerset.sigs.k8s.io/name=leaderworkerset-rollout
NAME                        READY   STATUS             RESTARTS   AGE
leaderworkerset-rollout-0   0/1     ImagePullBackOff   0          48s
```

**Anything else we need to know?**:
I suspect this issue is related to the following code snippet:
1. In this section, the code prematurely scales down the expected maxSurge in order to gradually remove surge replicas. This causes finalReplicas to go back to lws.replica = 1.
```go
// wantReplicas calculates the final replicas if needed.
wantReplicas := func(unreadyReplicas int32) int32 {
    if unreadyReplicas <= int32(maxSurge) {
        // When we have n unready replicas and n bursted replicas, we should
        // start to release the burst replica gradually for the accommodation of
        // the unready ones.
        finalReplicas := lwsReplicas + utils.NonZeroValue(int32(unreadyReplicas)-1)
        r.Record.Eventf(lws, corev1.EventTypeNormal, GroupsProgressing,
            fmt.Sprintf("deleting surge replica %s-%d", lws.Name, finalReplicas))
        return finalReplicas
    }
    return burstReplicas
}
```

**Environment**:

- Kubernetes version (use `kubectl version`): Client Version: v1.34.0
Kustomize Version: v5.7.1
Server Version: v1.34.1-aliyun.1
- LWS version (use `git describe --tags --dirty --always`): v0.7.0
- Cloud provider or hardware configuration:
- OS (e.g: `cat /etc/os-release`):
- Kernel (e.g. `uname -a`):
- Install tools:
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LWS does not support zero-downtime updates when replicas = 1 #688

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LWS does not support zero-downtime updates when replicas = 1 #688

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions