Skip to content

批次灰度配合hpa,灰度的数量并没有按照预期执行 #321

@xiaojifan

Description

@xiaojifan

问题描述:

我们的服务接入了hpa做了自动弹性扩缩,在使用rollouts做分批次的灰度发布时,第一批次灰度配置的灰度是1%,但是当自动弹性扩缩弹至28的时候,灰度数量并未按照预期灰度一个pod,总共起了5个灰度的pod。

服务编排:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: ${APP_NAME}
  name: ${APP_NAME}
  namespace: pro-k8s
spec:
  progressDeadlineSeconds: 600
  replicas: ${POD_NUM}
  strategy:
    rollingUpdate:
      maxSurge: 6
      maxUnavailable: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: ${APP_NAME}
  template:
    metadata:
      annotations:
        alibabacloud.com/burst-resource: eci_only
        k8s.aliyun.com/eci-reschedule-enable: "true"
        k8s.aliyun.com/eci-extra-ephemeral-storage: "200Gi"
        k8s.aliyun.com/eci-use-specs: ecs.c6.4xlarge,ecs.ic5.6xlarge,ecs.hfc6.6xlarge
      labels:
        app: ${APP_NAME}
        armsPilotAutoEnable: "$ARMS_SWITCH"
        armsPilotCreateAppName: "$APP_NAME"
    spec:
      containers:
        - env:
            - name: API_ENV
              value: pro
            - name: AppName
              value: ${APP_NAME}
          image: server:$BUILD_TAG
          readinessProbe:
            httpGet:
              path: /health
              port: ${PORT}
            initialDelaySeconds: 180
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 6
          livenessProbe:
            httpGet:
              path: /health
              port: ${PORT}
            initialDelaySeconds: 240
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 18
          imagePullPolicy: Always
          name: ${APP_NAME}
          ports:
            - containerPort: ${PORT}
              protocol: TCP
          resources:
            limits:
              cpu: 16
              memory: 32Gi
            requests:
              cpu: 16
              memory: 32Gi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsConfig:
        options:
        - name: ndots
          value: "3"
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      terminationGracePeriodSeconds: 600

灰度编排:

apiVersion: rollouts.kruise.io/v1beta1
kind: Rollout
metadata:
  name: canary-${APP_NAME}
  namespace: pro-k8s
spec:
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ${APP_NAME}
  strategy:
    canary:
      enableExtraWorkloadForCanary: false
      steps:
      - replicas: 1%
      - replicas: 50%
      - replicas: 100%

hpa编排:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-${APP_NAME}
  namespace: pro-k8s
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ${APP_NAME}
  minReplicas: 8
  maxReplicas: 28
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Pods
        value: 20
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Pods
        value: 4
        periodSeconds: 600

预期结果:
灰度策略配置是1%的话,拉起的28个pod中,也应该只有一个灰度的pod才没问题

如何重现此问题(尽可能简洁准确):

  • 目前还未很准确判断什么原因引起,早期在测试最大弹起20个pod的时候没出现,今天弹28出现次问题

环境:
当前版本:rollout:v0.6.1
k8s集群版本:1.22.15-aliyun.1
安装详情:默认 helm 安装

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions