Skip to content

Unexpected higher latencies during rolling upgrade #8189

@helenapoleri

Description

@helenapoleri

Describe the bug
We have a setup with 3 Kafka nodes. We have a system A producing records to Kafka, which are consumed by a system B. System B does some processing and produces response records to be consumed by system A.

We are experiencing unexpected higher latencies end-to-end latencies when using Strimzi and performing a rolling upgrade. We are using the default configurations.

End-to-end latency (measured at system A):
kafka-latency-rolling-edited

We have tried doing a manual restart of the Kafka pods (by killing the Kafka process PID (with a SIGTERM) for each pod and waiting for the latencies to stabilize) and we are not seeing the same behaviour.

End-to-end latency (measured at system A:

latency-kafka-manual-edited

While it might be expected that during a rolling upgrade we see a spike in latencies, we were not expecting to see such a big difference between the manual restarts and the rolling upgrade.

To Reproduce
We are reproducing by just triggering a rolling upgrade (with no changes).

Expected behavior
We are expecting at least to have similar latencies to when we perform the restarts manually, but we also don't know whether this is expected behaviour using Kafka.

Environment (please complete the following information):

  • Strimzi version: 0.31.1
  • Installation method: Helm chart (via Flux)
  • Kubernetes cluster: 1.23+
  • Infrastructure: Amazon EKS

YAML files and logs

Kafka cluster:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka
spec:
  kafka:
    version: 3.2.3
    replicas: 3
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.2"
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      kafkaContainer:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all
    resources:
      requests:
        cpu: 5000m
        memory: 14336Mi
      limits:
        cpu: 5000m
        memory: 14336Mi
    storage:
      type: jbod
      volumes:
      - id: 0
        type: persistent-claim
        size: 20Gi
        deleteClaim: false
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
  zookeeper:
    replicas: 3
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      zookeeperContainer:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all
    resources:
      requests:
        cpu: 500m
        memory: 500Mi
      limits:
        cpu: 500m
        memory: 500Mi
    storage:
      type: persistent-claim
      size: 100Gi
      deleteClaim: false
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: zookeeper-metrics-config.yml
  kafkaExporter:
    groupRegex: ".*"
    topicRegex: ".*"
    resources:
      requests:
        cpu: 200m
        memory: 100Mi
      limits:
        cpu: 200m
        memory: 100Mi
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      container:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all

Additional context

  • Kafka cluster has 3 nodes as you can see above. Our topic replication factor is 3 and minISR is 2. All of our topics only have one partition.
  • We are using Kafka default configurations for our producers. For our use case we do not need auto commits, so we have disabled that for our consumers. Otherwise, we are using the default configurations for our consumers, too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions