Skip to content

Volume still hang on Karpenter Node Consolidation/Termination #1955

Closed
@levanlongktmt

Description

@levanlongktmt

/kind bug

What happened?
As discussed at #1665, @torredil said it's fixed in v1.27 (#1665 (comment)) but we still got problem with v1.28

  • The pod using volume pv-A running in node N1
  • Karpenter terminate pod and terminate node N1
  • K8s start new pod and trying attach volume pv-A but still need to wait 6 minutes to be release and attach to new Pod

What you expected to happen?

  • After old pod has been terminated, the pv-A should be released and able to attach to new pod

How to reproduce it (as minimally and precisely as possible)?

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: dev
spec:
  version: 8.12.2
  volumeClaimDeletePolicy: DeleteOnScaledownAndClusterDeletion
  updateStrategy:
    changeBudget:
      maxSurge: 2
      maxUnavailable: 1
  nodeSets:
  - name: default
    count: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data # Do not change this name unless you set up a volume mount for the data path.
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 25Gi
    podTemplate:
      spec:
        nodeSelector:
          kubernetes.io/arch: arm64
          topology.kubernetes.io/zone: eu-central-1a
        containers:
        - name: elasticsearch
          env:
            - name: ES_JAVA_OPTS
              value: -Xms4g -Xmx4g
          resources:
            requests:
              memory: 5Gi
              cpu: 1
            limits:
              memory: 5Gi
              cpu: 2
    config:
      node.store.allow_mmap: false
  • Trigger spot instance termination or just delete 1 ec2 instance
  • The node has been removed in k8s very quick, old pod has been Terminated and k8s start new pod
  • Pod stuck in 6 minutes with error Multi-Attach error for volume "pvc-xxxxx-xxxxx-xxx" Volume is already exclusively attached to one node and can't be attached to another
  • After 6 minutes new pod can attach volume
  • Here is logs of ebs-csi-controller
I0302 06:12:10.305080       1 controller.go:430] "ControllerPublishVolume: attached" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1" devicePath="/dev/xvdaa"
<< at 06:14 the node has been terminated but no logs here >>
I0302 06:20:18.486042       1 controller.go:471] "ControllerUnpublishVolume: detaching" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1"
I0302 06:20:18.584737       1 cloud.go:792] "DetachDisk: called on non-attached volume" volumeID="vol-02b33186429105461"
I0302 06:20:18.807752       1 controller.go:474] "ControllerUnpublishVolume: attachment not found" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1"
I0302 06:20:19.124534       1 controller.go:421] "ControllerPublishVolume: attaching" volumeID="vol-02b33186429105461" nodeID="i-0ee2a470112401ffb"
I0302 06:20:20.635493       1 controller.go:430] "ControllerPublishVolume: attached" volumeID="vol-02b33186429105461" nodeID="i-0ee2a470112401ffb" devicePath="/dev/xvdaa"

Anything else we need to know?:
I setup csi driver using eks add-on
Environment

  • Kubernetes version (use kubectl version):
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0-eks-c417bb3
  • Driver version: v1.28.0-eksbuild.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions