Skip to content

ResourceClaims stuck pending deletion due to finalizer #725

@robertdavidsmith

Description

@robertdavidsmith

Hi,

We have an issue with "zombie" resource claims. These resource claims have deletionTimestamp set but don't get removed because finalizer resource.kubernetes.io/delete-protection doesn't get removed.

You see these in both workload namespaces and in ns nvidia-dra-driver-gpu, for example

admin@rob1:~$ kc armada-06.ospr-k8s-batch-p.diva-h3  get resourceclaim -A  | grep dele
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f469nr7    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f46rnn2    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4b4zsx    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4hjnll    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4jkrz8    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4jv98p    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4lds8p    deleted,allocated,reserved   11h
nvidia-dra-driver-gpu    armada-16-gang-6adfe845b489c0f441cb31206e-compute-domainwhcrm   deleted,allocated,reserved   11h
nvidia-dra-driver-gpu    armada-8-gang-b05b6e44496ca386828dc86a6f-compute-domainzrlqr    deleted,allocated,reserved   2d1h
plat-jul         armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca38685zqtt   deleted,allocated,reserved   2d1h
plat-jul         armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868bq6z9   deleted,allocated,reserved   2d1h
plat-jul         armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868hdwzc   deleted,allocated,reserved   2d1h
plat-jul         armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868qlw2l   deleted,allocated,reserved   2d1h

They stick around forever. Other jobs assigned to the node get stuck Pending because DRA still thinks the resource is taken.

You can get rid of them manually by removing the finalizer as below, and then the stuck-Pending pods start working.

kubectl -n nvidia-dra-driver-gpu patch resourceclaim armada-4-gang-ae6867deeb5c09ae46a0f54fd7-compute-domainqb475 -p {"metadata": {"finalizers": null}} --type merge
resourceclaim.resource.k8s.io/armada-4-gang-ae6867deeb5c09ae46a0f54fd7-compute-domainqb475 patched

In one case, we saw a stuck ResourceClaim appear when an nvidia-imex pod in nvidia-dra-driver-gpu was stuck Terminating due to a broken node. After many hours the stuck pod was manually removed but the ResourceClaim got stuck as described above.

This has been observered with both these versions of the nvidia-dra-driver-gpu

  • 25.3.1
  • 25.8.0 (with IMEXDaemonsWithDNSNames: false, true has not been tested)

Full yaml for a stuck ResourceClaim is below.

admin@rob1:~/src/NVIDIA--k8s-dra-driver-gpu$ kc armada-07.ospr-k8s-batch-p.diva-h3 -n gold-fran-userns get -oyaml resourceclaim armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5cc4dqm
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
  annotations:
    resource.kubernetes.io/pod-claim-name: armada-6-gang-6265a585030333a5ce3864f01d7d65f2
  creationTimestamp: "2025-11-04T14:20:53Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2025-11-04T14:26:02Z"
  finalizers:
  - resource.kubernetes.io/delete-protection
  generateName: armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5c
  name: armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5cc4dqm
  namespace: gold-fran-userns
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Pod
    name: armada-01k97kv31xceemm2rvyw75gd2j-0
    uid: 3bc03a5a-f278-4ecf-b7ef-fddb2defc70a
  resourceVersion: "94776445"
  uid: 199795a7-7c48-4b04-86b6-25285f5c384a
spec:
  devices:
    config:
    - opaque:
        driver: compute-domain.nvidia.com
        parameters:
          apiVersion: resource.nvidia.com/v1beta1
          domainID: 97351e62-4f2f-408c-a9dd-214c7ad7199a
          kind: ComputeDomainChannelConfig
      requests:
      - channel
    requests:
    - allocationMode: ExactCount
      count: 1
      deviceClassName: compute-domain-default-channel.nvidia.com
      name: channel
status:
  allocation:
    devices:
      config:
      - opaque:
          driver: compute-domain.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            domainID: 97351e62-4f2f-408c-a9dd-214c7ad7199a
            kind: ComputeDomainChannelConfig
        requests:
        - channel
        source: FromClaim
      results:
      - adminAccess: null
        device: channel-0
        driver: compute-domain.nvidia.com
        pool: armada-07--gb200-1b9--node2.ospr-k8s-batch-p.diva-h3.c3.zone
        request: channel
    nodeSelector:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - armada-07--gb200-1b9--node2.ospr-k8s-batch-p.diva-h3.c3.zone
  reservedFor:
  - name: armada-01k97kv31xceemm2rvyw75gd2j-0
    resource: pods
    uid: 3bc03a5a-f278-4ecf-b7ef-fddb2defc70a

Thanks for your help,

Rob

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions