Hi,
We have an issue with "zombie" resource claims. These resource claims have deletionTimestamp set but don't get removed because finalizer resource.kubernetes.io/delete-protection doesn't get removed.
You see these in both workload namespaces and in ns nvidia-dra-driver-gpu, for example
admin@rob1:~$ kc armada-06.ospr-k8s-batch-p.diva-h3 get resourceclaim -A | grep dele
gold-tom armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f469nr7 deleted,allocated,reserved 11h
gold-tom armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f46rnn2 deleted,allocated,reserved 11h
gold-tom armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4b4zsx deleted,allocated,reserved 11h
gold-tom armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4hjnll deleted,allocated,reserved 11h
gold-tom armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4jkrz8 deleted,allocated,reserved 11h
gold-tom armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4jv98p deleted,allocated,reserved 11h
gold-tom armada-01ka79yv28mhn580-armada-16-gang-6adfe845b489c0f4lds8p deleted,allocated,reserved 11h
nvidia-dra-driver-gpu armada-16-gang-6adfe845b489c0f441cb31206e-compute-domainwhcrm deleted,allocated,reserved 11h
nvidia-dra-driver-gpu armada-8-gang-b05b6e44496ca386828dc86a6f-compute-domainzrlqr deleted,allocated,reserved 2d1h
plat-jul armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca38685zqtt deleted,allocated,reserved 2d1h
plat-jul armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868bq6z9 deleted,allocated,reserved 2d1h
plat-jul armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868hdwzc deleted,allocated,reserved 2d1h
plat-jul armada-01ka387w42rh62z00-armada-8-gang-b05b6e44496ca3868qlw2l deleted,allocated,reserved 2d1h
They stick around forever. Other jobs assigned to the node get stuck Pending because DRA still thinks the resource is taken.
You can get rid of them manually by removing the finalizer as below, and then the stuck-Pending pods start working.
kubectl -n nvidia-dra-driver-gpu patch resourceclaim armada-4-gang-ae6867deeb5c09ae46a0f54fd7-compute-domainqb475 -p {"metadata": {"finalizers": null}} --type merge
resourceclaim.resource.k8s.io/armada-4-gang-ae6867deeb5c09ae46a0f54fd7-compute-domainqb475 patched
In one case, we saw a stuck ResourceClaim appear when an nvidia-imex pod in nvidia-dra-driver-gpu was stuck Terminating due to a broken node. After many hours the stuck pod was manually removed but the ResourceClaim got stuck as described above.
This has been observered with both these versions of the nvidia-dra-driver-gpu
25.3.1
25.8.0 (with IMEXDaemonsWithDNSNames: false, true has not been tested)
Full yaml for a stuck ResourceClaim is below.
admin@rob1:~/src/NVIDIA--k8s-dra-driver-gpu$ kc armada-07.ospr-k8s-batch-p.diva-h3 -n gold-fran-userns get -oyaml resourceclaim armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5cc4dqm
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaim
metadata:
annotations:
resource.kubernetes.io/pod-claim-name: armada-6-gang-6265a585030333a5ce3864f01d7d65f2
creationTimestamp: "2025-11-04T14:20:53Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2025-11-04T14:26:02Z"
finalizers:
- resource.kubernetes.io/delete-protection
generateName: armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5c
name: armada-01k97kv31xceemm2r-armada-6-gang-6265a585030333a5cc4dqm
namespace: gold-fran-userns
ownerReferences:
- apiVersion: v1
blockOwnerDeletion: true
controller: true
kind: Pod
name: armada-01k97kv31xceemm2rvyw75gd2j-0
uid: 3bc03a5a-f278-4ecf-b7ef-fddb2defc70a
resourceVersion: "94776445"
uid: 199795a7-7c48-4b04-86b6-25285f5c384a
spec:
devices:
config:
- opaque:
driver: compute-domain.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
domainID: 97351e62-4f2f-408c-a9dd-214c7ad7199a
kind: ComputeDomainChannelConfig
requests:
- channel
requests:
- allocationMode: ExactCount
count: 1
deviceClassName: compute-domain-default-channel.nvidia.com
name: channel
status:
allocation:
devices:
config:
- opaque:
driver: compute-domain.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
domainID: 97351e62-4f2f-408c-a9dd-214c7ad7199a
kind: ComputeDomainChannelConfig
requests:
- channel
source: FromClaim
results:
- adminAccess: null
device: channel-0
driver: compute-domain.nvidia.com
pool: armada-07--gb200-1b9--node2.ospr-k8s-batch-p.diva-h3.c3.zone
request: channel
nodeSelector:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- armada-07--gb200-1b9--node2.ospr-k8s-batch-p.diva-h3.c3.zone
reservedFor:
- name: armada-01k97kv31xceemm2rvyw75gd2j-0
resource: pods
uid: 3bc03a5a-f278-4ecf-b7ef-fddb2defc70a
Thanks for your help,
Rob
Hi,
We have an issue with "zombie" resource claims. These resource claims have
deletionTimestampset but don't get removed because finalizerresource.kubernetes.io/delete-protectiondoesn't get removed.You see these in both workload namespaces and in ns
nvidia-dra-driver-gpu, for exampleThey stick around forever. Other jobs assigned to the node get stuck
Pendingbecause DRA still thinks the resource is taken.You can get rid of them manually by removing the finalizer as below, and then the stuck-
Pendingpods start working.In one case, we saw a stuck ResourceClaim appear when an
nvidia-imexpod innvidia-dra-driver-gpuwas stuckTerminatingdue to a broken node. After many hours the stuck pod was manually removed but theResourceClaimgot stuck as described above.This has been observered with both these versions of the
nvidia-dra-driver-gpu25.3.125.8.0(withIMEXDaemonsWithDNSNames: false,truehas not been tested)Full yaml for a stuck ResourceClaim is below.
Thanks for your help,
Rob