Unexpected higher latencies during rolling upgrade

**Describe the bug**
We have a setup with 3 Kafka nodes. We have a system A producing records to Kafka, which are consumed by a system B. System B does some processing and produces response records to be consumed by system A.

We are experiencing unexpected higher latencies end-to-end latencies when using Strimzi and performing a rolling upgrade. We are using the default configurations.

End-to-end latency (measured at system A):
![kafka-latency-rolling-edited](https://user-images.githubusercontent.com/28529561/222735799-1b11749c-aba9-403c-8b2f-468c2a18b924.png)

We have tried doing a manual restart of the Kafka pods (by killing the Kafka  process PID (with a SIGTERM) for each pod and waiting for the latencies to stabilize) and we are not seeing the same behaviour.

End-to-end latency (measured at system A:

![latency-kafka-manual-edited](https://user-images.githubusercontent.com/28529561/222735805-59dfaa1d-b038-4afd-943b-773bbefd6c7d.png)

While it might be expected that during a rolling upgrade we see a spike in latencies, we were not expecting to see such a big difference between the manual restarts and the rolling upgrade.

**To Reproduce**
We are reproducing by just triggering a rolling upgrade (with no changes).

**Expected behavior**
We are expecting at least to have similar latencies to when we perform the restarts manually, but we also don't know whether this is expected behaviour using Kafka.

**Environment (please complete the following information):**
 - Strimzi version: 0.31.1
 - Installation method: Helm chart (via Flux)
 - Kubernetes cluster: 1.23+
 - Infrastructure: Amazon EKS

**YAML files and logs**

Kafka cluster:
```
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: kafka
spec:
  kafka:
    version: 3.2.3
    replicas: 3
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.2"
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      kafkaContainer:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all
    resources:
      requests:
        cpu: 5000m
        memory: 14336Mi
      limits:
        cpu: 5000m
        memory: 14336Mi
    storage:
      type: jbod
      volumes:
      - id: 0
        type: persistent-claim
        size: 20Gi
        deleteClaim: false
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
  zookeeper:
    replicas: 3
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      zookeeperContainer:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all
    resources:
      requests:
        cpu: 500m
        memory: 500Mi
      limits:
        cpu: 500m
        memory: 500Mi
    storage:
      type: persistent-claim
      size: 100Gi
      deleteClaim: false
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: zookeeper-metrics-config.yml
  kafkaExporter:
    groupRegex: ".*"
    topicRegex: ".*"
    resources:
      requests:
        cpu: 200m
        memory: 100Mi
      limits:
        cpu: 200m
        memory: 100Mi
    template:
      pod:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          fsGroup: 1001
          fsGroupChangePolicy: "OnRootMismatch"
          seccompProfile:
            type: RuntimeDefault
      container:
        securityContext:
          runAsNonRoot: true
          runAsUser: 1001
          runAsGroup: 1001
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
              - all
```

**Additional context**
* Kafka cluster has 3 nodes as you can see above. Our topic replication factor is 3 and minISR is 2. All of our topics only have one partition.
* We are using Kafka default configurations for our producers. For our use case we do not need auto commits, so we have disabled that  for our consumers. Otherwise, we are using the default configurations for our consumers, too.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected higher latencies during rolling upgrade #8189

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected higher latencies during rolling upgrade #8189

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions