ResourceClaims stuck pending deletion due to finalizer

Hi,

We have an issue with "zombie" resource claims. These resource claims have `deletionTimestamp` set but don't get removed because finalizer `resource.kubernetes.io/delete-protection` doesn't get removed.

You see these in both workload namespaces and in ns `nvidia-dra-driver-gpu`, for example

```
admin@rob1:~$ kc armada-06.ospr-k8s-batch-p.diva-h3  get resourceclaim -A  | grep dele
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f469nr7    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f46rnn2    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4b4zsx    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4hjnll    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4jkrz8    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4jv98p    deleted,allocated,reserved   11h
gold-tom           armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4lds8p    deleted,allocated,reserved   11h
nvidia-dra-driver-gpu    armada-16-gang-6adfe845b489c0f441cb31206e-compute-domainwhcrm   deleted,allocated,reserved   11h
nvidia-dra-driver-gpu    armada-8-gang-b05b6e44496ca386828dc86a6f-compute-domainzrlqr    deleted,allocated,reserved   2d1h
plat-jul         armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca38685zqtt   deleted,allocated,reserved   2d1h
plat-jul         armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868bq6z9   deleted,allocated,reserved   2d1h
plat-jul         armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868hdwzc   deleted,allocated,reserved   2d1h
plat-jul         armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868qlw2l   deleted,allocated,reserved   2d1h
```

They stick around forever. Other jobs assigned to the node get stuck `Pending` because DRA still thinks the resource is taken.

You can get rid of them manually by removing the finalizer as below, and then the stuck-`Pending` pods start working.

```
kubectl -n nvidia-dra-driver-gpu patch resourceclaim armada-4-gang-ae6867deeb5c09ae46a0f54fd7-compute-domainqb475 -p {"metadata": {"finalizers": null}} --type merge
resourceclaim.resource.k8s.io/armada-4-gang-ae6867deeb5c09ae46a0f54fd7-compute-domainqb475 patched
```

In one case, we saw a stuck ResourceClaim appear when an `nvidia-imex` pod in `nvidia-dra-driver-gpu` was stuck `Terminating` due to a broken node. After many hours the stuck pod was manually removed but the `ResourceClaim` got stuck as described above.


This has been observered with both these versions of the `nvidia-dra-driver-gpu`
- `25.3.1`
- `25.8.0` (with `IMEXDaemonsWithDNSNames: false`, `true` has not been tested)

Full yaml for a stuck ResourceClaim is below.
```
admin@rob1:~/src/NVIDIA--k8s-dra-driver-gpu$ kc armada-07.ospr-k8s-batch-p.diva-h3 -n gold-fran-userns get -oyaml resourceclaim armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5cc4dqm
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
  annotations:
    resource.kubernetes.io/pod-claim-name: armada-6-gang-6265a585030333a5ce3864f01d7d65f2
  creationTimestamp: "2025-11-04T14:20:53Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2025-11-04T14:26:02Z"
  finalizers:
  - resource.kubernetes.io/delete-protection
  generateName: armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5c
  name: armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5cc4dqm
  namespace: gold-fran-userns
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Pod
    name: armada-01k97kv31xceemm2rvyw75gd2j-0
    uid: 3bc03a5a-f278-4ecf-b7ef-fddb2defc70a
  resourceVersion: "94776445"
  uid: 199795a7-7c48-4b04-86b6-25285f5c384a
spec:
  devices:
    config:
    - opaque:
        driver: compute-domain.nvidia.com
        parameters:
          apiVersion: resource.nvidia.com/v1beta1
          domainID: 97351e62-4f2f-408c-a9dd-214c7ad7199a
          kind: ComputeDomainChannelConfig
      requests:
      - channel
    requests:
    - allocationMode: ExactCount
      count: 1
      deviceClassName: compute-domain-default-channel.nvidia.com
      name: channel
status:
  allocation:
    devices:
      config:
      - opaque:
          driver: compute-domain.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            domainID: 97351e62-4f2f-408c-a9dd-214c7ad7199a
            kind: ComputeDomainChannelConfig
        requests:
        - channel
        source: FromClaim
      results:
      - adminAccess: null
        device: channel-0
        driver: compute-domain.nvidia.com
        pool: armada-07--gb200-1b9--node2.ospr-k8s-batch-p.diva-h3.c3.zone
        request: channel
    nodeSelector:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - armada-07--gb200-1b9--node2.ospr-k8s-batch-p.diva-h3.c3.zone
  reservedFor:
  - name: armada-01k97kv31xceemm2rvyw75gd2j-0
    resource: pods
    uid: 3bc03a5a-f278-4ecf-b7ef-fddb2defc70a
```

Thanks for your help,

Rob

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceClaims stuck pending deletion due to finalizer #725

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ResourceClaims stuck pending deletion due to finalizer #725

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions