Skip to content

Conversation

@guptaNswati
Copy link
Contributor

@guptaNswati guptaNswati commented Sep 6, 2025

Addressing #360 to add preliminary health check

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guptaNswati
Copy link
Contributor Author

current log:

Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0905 23:51:54.358343       1 driver.go:208] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
W0905 23:51:54.358366       1 device_state.go:619] Attempted to mark unknown device as unhealthy: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0905 23:51:54.358482       1 driver.go:235] Successfully republished resources after marking device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy

resourceclaim status update is still broken.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds preliminary GPU health monitoring functionality to detect and handle unhealthy GPU devices in the NVIDIA DRA driver. The implementation listens for NVML events (XID errors, ECC errors) and removes unhealthy devices from the allocatable pool.

  • Introduces device health status tracking with Healthy/Unhealthy states
  • Implements NVML event-based health monitoring for GPU devices
  • Updates resource claim status to reflect device health conditions

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
cmd/gpu-kubelet-plugin/nvlib.go Initialize all devices with Healthy status
cmd/gpu-kubelet-plugin/driver.go Add device health monitor initialization and health notification handling
cmd/gpu-kubelet-plugin/device_state.go Add device health status updates and resource claim status reporting
cmd/gpu-kubelet-plugin/device_health.go New file implementing NVML event-based health monitoring
cmd/gpu-kubelet-plugin/allocatable.go Add health status field and methods to AllocatableDevice

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

if err != nil {
return nil, fmt.Errorf("start deviceHealthMonitor: %w", err)
}
klog.Info("[SWATI DEBUGS] Started device health monitor")
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the log message: 'DEBUGS' should be 'DEBUG' to match the pattern used in other debug messages.

Suggested change
klog.Info("[SWATI DEBUGS] Started device health monitor")
klog.Info("[SWATI DEBUG] Started device health monitor")

Copilot uses AI. Check for mistakes.
var resourceSlice resourceslice.Slice
for _, dev := range d.state.allocatable {
if dev.IsHealthy() {
klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev)
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the log message: 'resoureslice' should be 'resourceslice'.

Suggested change
klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev)
klog.Infof("[SWATI DEBUG] device is healthy, added to resourceslice: %v", dev)

Copilot uses AI. Check for mistakes.
}

// Republish updated resources
klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices")
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the log message: 'rebulishing' should be 'republishing'.

Suggested change
klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices")
klog.Info("[SWATI DEBUG] republishing resourceslice with healthy devices")

Copilot uses AI. Check for mistakes.
Config: configapi.DefaultMigDeviceConfig(),
})

// Swati: Add resourceclaim status update
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should follow proper Go comment conventions and be more descriptive. Consider: '// Add resource claim status update to track device health'.

Suggested change
// Swati: Add resourceclaim status update
// Add resource claim status update to track device health.

Copilot uses AI. Check for mistakes.
Comment on lines 305 to 306
// Swati add health check
klog.Info("[SWATI DEBUG] adding device status")
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should follow proper Go comment conventions. Consider: '// Add health status to device allocation result'.

Suggested change
// Swati add health check
klog.Info("[SWATI DEBUG] adding device status")
// Add health status to device allocation result

Copilot uses AI. Check for mistakes.
Comment on lines 43 to 50
//defer nvdevlib.alwaysShutdown()

//klog.Info("[SWATI DEBUG] getting all devices..")
//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config)
//if err != nil {
// return nil, fmt.Errorf("error enumerating all possible devices: %w", err)
//}

Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should be removed. If this code might be needed later, consider documenting why it's commented out or remove it entirely.

Suggested change
//defer nvdevlib.alwaysShutdown()
//klog.Info("[SWATI DEBUG] getting all devices..")
//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config)
//if err != nil {
// return nil, fmt.Errorf("error enumerating all possible devices: %w", err)
//}

Copilot uses AI. Check for mistakes.
}

func newDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*deviceHealthMonitor, error) {
klog.Info("[SWATI DEBUG] initializing NVML..")
Copy link

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message has inconsistent punctuation. Either use 'NVML...' (with proper ellipsis) or 'NVML' (without trailing dots).

Suggested change
klog.Info("[SWATI DEBUG] initializing NVML..")
klog.Info("[SWATI DEBUG] initializing NVML")

Copilot uses AI. Check for mistakes.
@guptaNswati
Copy link
Contributor Author

More logs after fixing republish of resourceslice when unhealthy gpu found

kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-ndv47 -n nvidia-dra-driver-gpu  -c gpus | grep unhealth 
I0908 23:07:58.793308       1 device_health.go:173] Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0908 23:07:58.793342       1 driver.go:208] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0908 23:07:58.793371       1 device_state.go:636] Marked device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy
E0908 23:07:58.793381       1 driver.go:220] device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 with uuid:&{%!s(*main.GpuInfo=&{GPU-a4f34abc-7715-3560-dcea-7238b9611a45 0 0 false 102625181696 NVIDIA GH200 96GB HBM3 Nvidia Hopper 9.0 570.86.15 12.8 0009:01:00.0 {resource.kubernetes.io/pcieRoot {<nil> <nil> 0x4000328130 <nil>}} [0x40008965a0 0x40008965d0 0x4000896600 0x4000896630 0x4000896660 0x4000896690 0x4000896840 0x40008972f0 0x4000897530 0x4000897560]}) %!s(*main.MigDeviceInfo=<nil>) Unhealthy} is unhealthy
I0908 23:07:58.793531       1 driver.go:235] Successfully republished resources after marking device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy

 "ResourceSlice update" logger="ResourceSlice controller" slice="sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5" diff=<
	@@ -3,8 +3,8 @@
	   "name": "sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5",
	   "generateName": "sc-starwars-mab9-b00-gpu.nvidia.com-",
	   "uid": "b5a8727d-b8cd-4073-8817-d3e31147a8bd",
	-  "resourceVersion": "50777207",
	-  "generation": 1,
	+  "resourceVersion": "50777758",
	+  "generation": 2,
	   "creationTimestamp": "2025-09-08T23:05:30Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1beta1",
	-    "time": "2025-09-08T23:05:30Z",
	+    "time": "2025-09-08T23:07:58Z",
	     "fieldsType": "FieldsV1",
	     "fieldsV1": {
	      "f:metadata": {

$ kubectl get resourceslice  sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5  -o yaml 
apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  creationTimestamp: "2025-09-08T23:05:30Z"
  generateName: sc-starwars-mab9-b00-gpu.nvidia.com-
  generation: 2
  name: sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: sc-starwars-mab9-b00
    uid: 80ede971-5b44-4a12-a951-a1bebe79209d
  resourceVersion: "50777758"
  uid: b5a8727d-b8cd-4073-8817-d3e31147a8bd
spec:
  devices:
  - basic:
      attributes:
        architecture:
          string: Hopper
        brand:
          string: Nvidia
        cudaComputeCapability:
          version: 9.0.0
        cudaDriverVersion:
          version: 12.8.0
        driverVersion:
          version: 570.86.15
        index:
          int: 1
        minor:
          int: 1
        pcieBusID:
          string: "0019:01:00.0"
        productName:
          string: NVIDIA GH200 96GB HBM3
        resource.kubernetes.io/pcieRoot:
          string: pci0019:00
        type:
          string: gpu
        uuid:
          string: GPU-9e6df7cb-64d4-5e53-2b1d-cee9e58aeb94
      capacity:
        memory:
          value: 97871Mi
    name: gpu-1
  driver: gpu.nvidia.com
  nodeName: sc-starwars-mab9-b00
  pool:
    generation: 1
    name: sc-starwars-mab9-b00
    resourceSliceCount: 1

@guptaNswati
Copy link
Contributor Author

guptaNswati commented Sep 8, 2025

need to fix resourceclaim status update: not using the right client api

Device gpu-0 is healthy, marking as ready
E0908 23:06:44.085161       1 device_state.go:346] failed to update status for claim gpu-test1/pod1-gpu-zc6s4: not implemented in k8s.io/dynamic-resource-allocation/client

failed to update status for claim gpu-test1/pod2-gpu-q45rg: not implemented in k8s.io/dynamic-resource-allocation/client

@klueska
Copy link
Collaborator

klueska commented Sep 9, 2025

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

@klueska
Copy link
Collaborator

klueska commented Sep 9, 2025

Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet.

@guptaNswati
Copy link
Contributor Author

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

Yes. this is just to test the e2e flow (which is to report any health events and example action is to republish the slice by the driver). This is just to see if i have setup everything correctly.

@guptaNswati
Copy link
Contributor Author

Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet.

Not resourceslice, but update the resourceclaim status similar to this https://github.com/google/dranet/pull/78/files#diff-e8a7e777d80a14b455bdbf7aae3f28ad8082ffa0a06579e11cc1af741b5f98f7R266

@guptaNswati
Copy link
Contributor Author

guptaNswati commented Sep 9, 2025

Got the resourceclaim status to be updated

 Device gpu-1 is healthy, marking as ready
I0909 21:53:04.772855       1 round_trippers.go:632] "Response" logger="dra" requestID=7 method="/k8s.io.kubelet.pkg.apis.dra.v1beta1.DRAPlugin/NodePrepareResources" verb="PATCH" url="https://x.x.x.x:443/apis/resource.k8s.io/v1beta1/namespaces/gpu-test1/resourceclaims/pod1-gpu-rrkx5/status?fieldManager=gpu.nvidia.com&force=true" status="200 OK" milliseconds=4
I0909 21:53:04.772960       1 device_state.go:348] updated device status for claim gpu-test1/pod1-gpu-rrkx5

  devices:
  - conditions:
    - lastTransitionTime: "2025-09-09T21:53:04Z"
      message: Device is healthy and ready
      reason: Healthy
      status: "True"
      type: Ready
    data: null
    device: gpu-1
    driver: gpu.nvidia.com
    pool: sc-starwars-mab9-b00

Signed-off-by: Swati Gupta <[email protected]>
@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Sep 11, 2025
Signed-off-by: Swati Gupta <[email protected]>
@guptaNswati
Copy link
Contributor Author

guptaNswati commented Sep 19, 2025

Updated action on health event: update device condition to unhealthy in resourceclaim status

$ kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-m8xsz -n nvidia-dra-driver-gpu -c gpus

1 device_health.go:167] Processing event {Device:{Handle:0xee82742dfef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
W0919 02:59:06.452857       1 device_health.go:170] Critical XID error detected on device: {Device:{Handle:0xee82742dfef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I0919 02:59:06.452874       1 device_health.go:200] Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0919 02:59:06.452905       1 driver.go:212] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0919 02:59:06.452918       1 device_state.go:617] Marked device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy
I0919 02:59:06.453547       1 driver.go:298] found matching device to claim: gpu-0
I0919 02:59:06.453556       1 driver.go:312] Found it! Return the result object: gpu-0 and the claim UID: 590e5164-7511-418d-8b8b-77ae0e414dc6
I0919 02:59:06.456314       1 round_trippers.go:632] "Response" verb="GET" url="https://10.96.0.1:443/apis/resource.k8s.io/v1beta1/resourceclaims" status="200 OK" milliseconds=2
I0919 02:59:06.456538       1 driver.go:335] found ResourceClaim with UID 590e5164-7511-418d-8b8b-77ae0e414dc6 not found
I0919 02:59:06.456548       1 driver.go:345] Applying 'Ready=False' condition for device 'gpu-0' in ResourceClaim 'gpu-test1/pod2-gpu-l8rrx'
I0919 02:59:06.460688       1 round_trippers.go:632] "Response" verb="PATCH" url="https://x.x.x.x:443/apis/resource.k8s.io/v1beta1/namespaces/gpu-test1/resourceclaims/pod2-gpu-l8rrx/status?fieldManager=gpu.nvidia.com&force=true" status="200 OK" milliseconds=3

$ kubectl get resourceclaim -n gpu-test1  -o yaml | grep -A 8 condition
    - conditions:
      - lastTransitionTime: "2025-09-09T21:53:04Z"
        message: Device is healthy and ready
        reason: Healthy
        status: "True"
        type: Ready
      data: null
      device: gpu-1
      driver: gpu.nvidia.com
--
    - conditions:
      - lastTransitionTime: "2025-09-19T02:59:06Z"
        message: Device gpu-0 has become unhealthy.
        reason: DeviceUnhealthy
        status: "False"
        type: Ready
      data: null
      device: gpu-0
      driver: gpu.nvidia.com

@guptaNswati guptaNswati changed the title Draft: Gpu health check Gpu health check Sep 19, 2025
@guptaNswati
Copy link
Contributor Author

@ArangoGutierrez @klueska can i get a prelim review on this. There are still some tasks but its in working state.

@klueska klueska moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Sep 23, 2025
@guptaNswati
Copy link
Contributor Author

test of skipped xid:

$ helm upgrade nvidia-dra-driver-gpu  deployments/helm/nvidia-dra-driver-gpu --set featureGates.DeviceHealthCheck=true --set kubeletPlugin.gpus.additionalXidsToIgnore="43"

kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-qzplg  -n nvidia-dra-driver-gpu -c gpus | grep event
'I0924 18:24:31.947121       1 device_health.go:58] creating NVML events for device health monitor
I0924 18:24:31.947143       1 device_health.go:68] registering NVML events for device health monitor
I0924 18:28:04.610817       1 device_health.go:175] Skipping event {Device:{Handle:0xe44bad2ffef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

Signed-off-by: Swati Gupta <[email protected]>
Comment on lines +583 to +590
func (s *DeviceState) MarkDeviceUnhealthy(device *AllocatableDevice) {
// SWATI: check if a mig device is marked properly
s.Lock()
defer s.Unlock()

device.Health = Unhealthy
klog.Infof("Marked device:%s unhealthy", device.GetUUID())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this take health as a parameter so it can be reused once we have the ability to bring a device back to healthy?

Comment on lines +252 to +272
SearchDeviceGroups:
for _, group := range preparedClaim.PreparedDevices {
for _, device := range group.Devices {
var currentUUID string
var currentDeviceName string

if device.Gpu != nil {
currentUUID = device.Gpu.Info.UUID
currentDeviceName = device.Gpu.Device.DeviceName
} else if device.Mig != nil {
currentUUID = device.Mig.Info.UUID
currentDeviceName = device.Mig.Device.DeviceName
}

if currentUUID == unhealthyDeviceUUID {
klog.V(6).Infof("found matching device: %v for claim: %s", currentDeviceName, claimUID)
matchingDeviceName = currentDeviceName
break SearchDeviceGroups
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the labeled break by putting this in a function and returning at the point that you find what you are looking for.

}

func (d *driver) findClaimByUID(ctx context.Context, claimUID types.UID) (*v1beta1.ResourceClaim, error) {
claimList, err := d.state.config.clientsets.Core.ResourceV1beta1().ResourceClaims("").List(ctx, metav1.ListOptions{})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to work not just for the v1beta1 api, but all of v1, v1beta1, and v1beta2. We have helpers to do that in the staging repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was not sure how to do it. link?

}

func (d *driver) findClaimByUID(ctx context.Context, claimUID types.UID) (*v1beta1.ResourceClaim, error) {
claimList, err := d.state.config.clientsets.Core.ResourceV1beta1().ResourceClaims("").List(ctx, metav1.ListOptions{})
Copy link
Collaborator

@klueska klueska Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about listing all Resource claims and then searching through them for the matching UID. It feels like we should rather be storing the claim name / namespace in the checkpoint so we can pull it directly and then assert that it has the correct UID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes yes, this is not efficient, i dint want to make changes to the existing code as we are not sure if we want to take this action or not. This is more of a sample action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we may update checkpoint to add claim name/namespace for faster lookup for claims status update of the unhealthy device, so that there is no need to iterate on all claims and find the needed one.

For another usecase, its already gettimg updated for computedomain.

Comment on lines +231 to +235
if err := d.state.applyClaimDeviceStatuses(ctx, claim.Namespace, claim.Name, ds); err != nil {
klog.Errorf("Failed to update status for claim %s/%s: %v", claim.Namespace, claim.Name, err)
} else {
klog.V(6).Infof("applied unhealthy device status to claim %s/%s", claim.Namespace, claim.Name)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should just give up here. Its likely we may have a conflict when writing and we need to try again. Instead we should push a task to the workqueue that will keep retrying until it succceeds.


claim, err := d.findClaimByUID(ctx, types.UID(claimUID))
if err != nil {
klog.Errorf("Failed to find ResourceClaim object for UID %s: %v", claimUID, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessarily an error.

// Update allocated device health status in a given claim
result, claimUID, err := d.findDeviceResultAndClaimUID(uuid)
if err != nil {
klog.Errorf("Device %s is unhealthy, but no associated claim was found: %v", uuid, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessarily an error.

klog.V(6).Infof("Adding devices health status to claim %s/%s", claim.Namespace, claim.Name)
if err := s.applyClaimDeviceStatuses(ctx, claim.Namespace, claim.Name, deviceStatuses...); err != nil {
klog.Warningf("Failed to update devices status for claim %s/%s: %v", claim.Namespace, claim.Name, err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level question -- who is going to be reading this status from the ResourceClaim and doing anything with it? I know I suggested looking at DRANet and seeing how they were reporting their health status, but does it even make sense to do this in our driver? Why does DRANet need to do it?

/cc @aojea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRANET uses the standardized fields for network information on the status, so cni-dra-driver

kubernetes/enhancements#4817

Using standard data for reporting this information in status allow us to build tooling and applications on top, specially useful for some use cases of multi networking or monitoring

@klueska
Copy link
Collaborator

klueska commented Oct 7, 2025

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

As discussed in the team meeting. We need to bring this back for the intial release of these health checks. The "right" way to marm the GPUs as unhealthy will be with device taints, but those are still an alpha feature, and we need someway to mark them as unhealthy / unschedulable in the interim.

@klueska klueska modified the milestones: v25.8.1, v25.12.0 Oct 8, 2025
d.state.MarkDeviceUnhealthy(device)

// Update allocated device health status in a given claim
result, claimUID, err := d.findDeviceResultAndClaimUID(uuid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does d.findDeviceResultAndClaimUID(uuid) only operate on local file system state (checkpoint data)? Or is any networking interaction or IPC involved?

I would find it helpful to clarify that in the code comment right above that call, or to even reflect that in the method name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local checkpoint data.

continue
}

claim, err := d.findClaimByUID(ctx, types.UID(claimUID))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a code comment here explaining why at this point it is relevant to look up the full claim object?

That code comment would clarify the importance of having that data at all, and having it fresh. It would also clarify

  • how much of a problem it is if we do not find the claim
  • how much of a problem it is to use an outdated version of the claim object

That's so important to have clarified clearly, and maybe we want to discuss that goal statement (specification) before getting too deep into the weeds of the error handling / retrying behavior implementation review.

@guptaNswati
Copy link
Contributor Author

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

As discussed in the team meeting. We need to bring this back for the intial release of these health checks. The "right" way to marm the GPUs as unhealthy will be with device taints, but those are still an alpha feature, and we need someway to mark them as unhealthy / unschedulable in the interim.

For this to work, we also need a reconciliation to bring them back once there is remediation and GPU is healthy again.

@guptaNswati
Copy link
Contributor Author

Quick Mig test logs:

I1014 23:30:24.276796       1 device_health.go:179] Processing event {Device:{Handle:0xfe835631fef0} EventType:8 EventData:43 GpuInstanceId:7 ComputeInstanceId:0}
I1014 23:30:24.276920       1 device_health.go:192] Event for mig device: &{<nil> 0x40005a3900 Healthy}
I1014 23:30:24.276949       1 device_health.go:202] Sending unhealthy notification for device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 due to event type: 8 and event data: 43
W1014 23:30:24.276999       1 driver.go:221] Received unhealthy notification for device: MIG-4d806f22-346a-5a1d-ac01-86b505cdf485
I1014 23:30:24.277025       1 device_state.go:590] Marked device:MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 unhealthy
E1014 23:30:24.277488       1 driver.go:229] Device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 is unhealthy, but no associated claim was found: unable to find device result and claim uid for MIG-4d806f22-346a-5a1d-ac01-86b505cdf485

Signed-off-by: Swati Gupta <[email protected]>
@guptaNswati
Copy link
Contributor Author

Closing this in favor of #689

@github-project-automation github-project-automation bot moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

Development

Successfully merging this pull request may close these issues.

6 participants