Description
Checklist:
- I've included steps to reproduce the bug.
- I've included the version of argo rollouts.
Describe the bug
When using canary strategy and adding dynamicStableScale
if a newer release were pushed (v3) while an unstable release (v2) is still being deployed, argo rollouts controller will shift all the traffic to the stable release (v1) without waiting for it to scale up.
To Reproduce
Just deploy a simple rollout with traffic routing and dynamicStableScale
enabled, then change the image version to 1.25.2
and apply the changes, now the v1 replicaset will have 1 replica while v2 will have 3, change the image to 1.25.3
and the issue will happen, it will immediately shift all traffic to v1 without waiting for it pods to start, since we are using nginx as an example it will be very short, but on production, this caused an outage
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: test-argo-rollout
labels:
app: test-argo-rollout
spec:
replicas: 4
strategy:
canary:
canaryService: test-argo-rollout-canary
stableService: test-argo-rollout
dynamicStableScale: true
abortScaleDownDelaySeconds: 300
maxUnavailable: 0
trafficRouting:
nginx:
stableIngress: test-argo-rollout-ingress
steps:
- setWeight: 90
- pause: {}
selector:
matchLabels:
app: test-argo-rollout
template:
metadata:
labels:
app: test-argo-rollout
spec:
containers:
- name: nginx
image: nginx:1.25.1
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
name: http
protocol: TCP
resources:
requests:
cpu: 100m
memory: 100Mi
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: test-argo-rollout-ingress
annotations:
kubernetes.io/ingress.class: nginx
spec:
ingressClassName: nginx
rules:
- host:<HOST-HERE>
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: test-argo-rollout
port:
number: 80
---
apiVersion: v1
kind: Service
metadata:
name: test-argo-rollout
labels:
app: test-argo-rollout
spec:
type: ClusterIP
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: test-argo-rollout
---
apiVersion: v1
kind: Service
metadata:
name: test-argo-rollout-canary
labels:
app: test-argo-rollout
spec:
type: ClusterIP
ports:
- port: 80
targetPort: http
protocol: TCP
name: http
selector:
app: test-argo-rollout
Expected behavior
It should act exactly like an abort on v2 so the traffic shifting to v1 will be gradual, and then move to v3
Version
1.6.0
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.