Skip to content

Conversation

@guptaNswati
Copy link
Contributor

@guptaNswati guptaNswati commented Oct 17, 2025

Addressing #360 to add preliminary health check similar to https://github.com/NVIDIA/k8s-device-plugin.

  • Clean follow-up of Gpu health check #545
  • addressing review comments
  • republising of resourceslice on health event
  • feature gate: --set featureGates.DeviceHealthCheck=true
  • xids: --set kubeletPlugin.gpus.additionalXidsToIgnore="n1,n2"

Test logs:

I1017 20:16:19.799738       1 device_health.go:179] Processing event {Device:{Handle:0xe4aea6b2fef0} EventType:8 EventData:43 GpuInstanceId:7 ComputeInstanceId:0}
I1017 20:16:19.799821       1 device_health.go:192] Event for mig device: &{<nil> 0x40006a0070 Healthy}
I1017 20:16:19.799843       1 device_health.go:202] Sending unhealthy notification for device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 due to event type: 8 and event data: 43
W1017 20:16:19.799870       1 driver.go:219] Received unhealthy notification for device: MIG-4d806f22-346a-5a1d-ac01-86b505cdf485
I1017 20:16:19.799884       1 device_state.go:558] Update device sattus:MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 healthstatus
I1017 20:16:19.799891       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c150 Healthy}
I1017 20:16:19.799955       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0ee0 Healthy}
I1017 20:16:19.799966       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0f50 Healthy}
I1017 20:16:19.799974       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0230 Healthy}
I1017 20:16:19.799983       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a02a0 Healthy}
I1017 20:16:19.799992       1 driver.go:230] Device is healthy, added to resoureslice: &{0x40002ae000 <nil> Healthy}
I1017 20:16:19.800000       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c0e0 Healthy}
W1017 20:16:19.800009       1 driver.go:233] Device:MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 with uuid:&{%!s(*main.GpuInfo=<nil>) %!s(*main.MigDeviceInfo=&{MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 1g.12gb 0x40002503c0 0x400049c130 0x4000610090 0x400045ce40 0x4000610150 0x40004840f0 0009:01:00.0 0x40002fbf50}) Unhealthy} is unhealthy
I1017 20:16:19.800021       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a00e0 Healthy}
I1017 20:16:19.800031       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c1c0 Healthy}
I1017 20:16:19.800043       1 driver.go:230] Device is healthy, added to resoureslice: &{0x40002503c0 <nil> Healthy}
I1017 20:16:19.800050       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c000 Healthy}
I1017 20:16:19.800056       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0310 Healthy}
I1017 20:16:19.800063       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0150 Healthy}
I1017 20:16:19.800069       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c070 Healthy}
I1017 20:16:19.800076       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a01c0 Healthy}
I1017 20:16:19.800084       1 driver.go:237] [Rebulishing resourceslice with healthy devices
I1017 20:16:19.800142       1 driver.go:247] Successfully republished resources without unhealthy device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485:
I1017 20:16:19.800178       1 resourceslicecontroller.go:647] "Existing slices" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" obsolete=[] current=["sc-starwars-mab9-b00-gpu.nvidia.com-kmcts"]
I1017 20:16:19.800209       1 resourceslicecontroller.go:724] "Need to update slice" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" matchIndex=0
I1017 20:16:19.800225       1 resourceslicecontroller.go:727] "Completed comparison" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" numObsolete=0 numMatchedSlices=1 numChangedMatchedSlices=1 numNewSlices=0
I1017 20:16:19.800230       1 resourceslicecontroller.go:743] "Kept generation because at most one update API call is necessary" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" generation=1
I1017 20:16:19.805795       1 round_trippers.go:632] "Response" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" verb="PUT" url="https://10.96.0.1:443/apis/resource.k8s.io/v1beta1/resourceslices/sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" status="200 OK" milliseconds=5
I1017 20:16:19.806290       1 resourceslicecontroller.go:779] "Updated existing resource slice" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts"
I1017 20:16:19.807922       1 resourceslicecontroller.go:500] "ResourceSlice update" logger="ResourceSlice controller" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" diff=<
	@@ -3,8 +3,8 @@
	   "name": "sc-starwars-mab9-b00-gpu.nvidia.com-kmcts",
	   "generateName": "sc-starwars-mab9-b00-gpu.nvidia.com-",
	   "uid": "7184f664-55c1-412d-bd99-4b46e7c23846",
	-  "resourceVersion": "59000652",
	-  "generation": 1,
	+  "resourceVersion": "59001011",
	+  "generation": 2,
	   "creationTimestamp": "2025-10-17T20:14:42Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1beta1",
	-    "time": "2025-10-17T20:14:42Z",
	+    "time": "2025-10-17T20:16:19Z
	.....
	.....
	
		     "name": "gpu-1",
	     "attributes": {
	      "architecture": {
	@@ -161,7 +496,7 @@
	     }
	    },
	    {
	-    "name": "gpu-0-mig-19-0-1",
	+    "name": "gpu-0-mig-19-1-1",
	     "attributes": {
	      "architecture": {
	       "string": "Hopper"
	@@ -197,56 +532,230 @@
	       "string": "mig"
	      },
	      "uuid": {
	-      "string": "MIG-4d806f22-346a-5a1d-ac01-86b505cdf485"
	-     }
	-    },

TLDR, This device had an event and is not added MIG-4d806f22-346a-5a1d-ac01-86b505cdf485

The device is picked back when driver is restarted.

Signed-off-by: Swati Gupta <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guptaNswati guptaNswati mentioned this pull request Oct 17, 2025
Signed-off-by: Swati Gupta <[email protected]>
klog.Infof("Processing event %+v", event)
eventUUID, ret := event.Device.GetUUID()
if ret != nvml.SUCCESS {
klog.Infof("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems bit aggressive to mark all devices as unhealthy on one invalid event. Should we log this as error and continue watch? cc @klueska

Copy link
Contributor Author

@guptaNswati guptaNswati Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@jgehrcke jgehrcke Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say we should log an error and otherwise proceed. Even if what you've shown here is currently done in the device plugin.

By the way, this would have been a perfect opportunity for a better code comment in the legacy code:

Image

No blame, no emotions -- but this code comment does not add information in addition to the code. The interesting bit would be if there is a specific, non-obvious reason / relevance for this style of treatment.

For example, I wonder if this code was introduced to fix a bug. I wonder if it is even ever exercised.

The way it's written and with the git blame history, it seems like it was potentially added initially (defensively) and may never have been exercised in production.

Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
}

if err := d.pluginhelper.PublishResources(ctx, resources); err != nil {
klog.Errorf("Failed to publish resources after device health status update: %v", err)
Copy link
Collaborator

@jgehrcke jgehrcke Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naturally, I wonder why this error is only handled by logging a message. This might be the correct (or currently best) decision. But please walk the reader of the code through the arguments for ending up with that decision, using a brief code comment.

I'd like to understand thoughts here in the lines of "do not retry, because" or "this is implicitly retried later, because" or "we could crash the plugin here, but" or "the old resource slice state remains published, which is good enough", and so on. I am sure you've thought through all this.

None of this is obvious to the reader of the code, and I'd really love to have some help here to convince myself that this is the right way to handle this error.

(as always, it will pay off to document the current argumentation for our future selves, even if it's incomplete or so)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrying make sense. and if retries also fails. It should be a fatal error as it means existing resourceslice is outdated

klog.Warningf("Received unhealthy notification for device: %s", uuid)

if !device.IsHealthy() {
klog.V(6).Infof("Device: %s is aleady marked unhealthy. Skip republishing resourceslice", uuid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, how often could we see a log message like this?

What I see here right now: we can get the d.deviceHealthMonitor.Unhealthy() event multiple times, even if we had already processed that device before. I wonder how often we should expect that to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be events burst. @lalitadithya showed me logs of device-plugin xid errors in a cluster which clearly showed same event logged multiple times.

@lalitadithya is it possible to share the log here.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience in waiting on a review @guptaNswati.

release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second))
if err != nil {
klog.Errorf("error acquiring prep/unprep lock for health status update: %v", err)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that we don't mark the device as unhealth in this case. Is that the intended behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not abort the event. Probably should just log the error and update the device status anyway..

klog.V(6).Info("Successfully republished resources without unhealthy device")
}

release()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we factor this logic into a function, then we could use a deferred call to release() after taking the lock. This may be less error prone if we do ever add code paths that return from this logic.

Comment on lines +151 to +157
&cli.StringFlag{
Name: "additional-xids-to-ignore",
Usage: "A comma-separated list of additional XIDs to ignore.",
Value: "",
Destination: &flags.additionalXidsToIgnore,
EnvVars: []string{"ADDITIONAL_XIDs_TO_IGNORE"},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NVIDIA/k8s-device-plugin#1443 we added a list of EXPLICIT XIDs to consider fatal. This allows a user to:

  1. Specify ignored XIDs (including all)
  2. Specify SPECIFIC XIDs that are considered fatal (including all).

The important thing here is that it allows users to override the list of hard-coded XIDs that we currently track.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of this and was planning to do this as a follow-up as it recently got merged.

}
m.eventSet = eventSet

m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does allocatable change at all, or is it constant for the lifetime of the plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do update the health status of an allocatable when we get a unhealthy notification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that wasn't clear. Does the content of allocatable change in any way that would invalidate the map that we construct here meaning that it would need to be reconstructed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the only changing content of an allocatable device is its health status and that wont impact it. This map ([uuid] = device) is constructed in the very beginning when the plugin is started.

and we iterate on it to register event on each device (currently, we dont check any status here. we may do it in future on remediation) and send unhealthy notification.

continue
}

release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Acquiring this lock here is a big decision.

Here, I really expect a concise / precise code comment explaining convincingly

  • why this lock must be acquired
  • how we guarantee that release() is always called

Maybe start by explaining what you think will go wrong when we do not acquire this lock here at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to double check that lock is released on any failures.

Copy link
Collaborator

@jgehrcke jgehrcke Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be mindful of acquiring this lock. Let's do it only for a strong reason. When we introduced that lock we named it prepare/unprepare lock because it's meant for that purpose.

Maybe we should use it here, too -- but let's pretty please thoroughly identify that strong reason, and put it into a few English sentences that are convincing.

I am not yet satisfied yet here by our arguments. We need to discuss the alternatives considered, I need more help please to understand why this is the correct approach (I really mean that -- it's not that I know what we should do -- but I sense that we don't, as a collective, understand yet what we really want to do here).

Copy link
Collaborator

@jgehrcke jgehrcke Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about the reason that you vaguely describe:

This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().

And I wonder: will it?

Can you describe a sequence of events where acquiring the lock would actually make that incoming nodePrepareResources() call not allocate an unhealthy device?

Here, we only update the ResourceSlice to un-announce any unhealthy device, right?

The moment we're done with that, we release the lock and the unchanged nodePrepareResources() call (that was waiting for us, hanging in lock acquisition) proceeds, trying to get what it wanted to get anyway.

When the kubelet already wants us to allocate an unhealthy device, then updating the ResourceSlice won't undo that (that gun is "fired" so to say), at least not reliably. Is that correct?

Let's say kubelet sends a PrepareResourceClaims() call our way and that may potentially result in allocating an unhealthy device (because the unhealthy notification arrived very recently).

Then I believe if we want a safe method to prevent that from happening we need to have logic within nodePrepareResource(). Does that make sense? (I quite literally don't know yet, you all have thought more about this than I did).

Specifically, I am wondring: do we need to call this new IsHealthy() somewhere in

func (s *DeviceState) prepareDevices(ctx context.Context, claim *resourceapi.ResourceClaim) (PreparedDevices, error) {
?

We can always actively fail a prepareDevices() if the only matching device turns out to be unhealthy right before we would have allocated it.

Maybe this is already similar?

		device, exists := s.allocatable[result.Device]
		if !exists {
			return nil, fmt.Errorf("requested device is not allocatable: %v", result.Device)
		}

device, exists := s.allocatable[result.Device]


I might be completely off in what I say here. The point is that I need to convince myself that really we know what we're doing here and that we only acquire the PU lock if we absolutely have to -- because it has potentially devastating downsides to do it unnecessarily often.

This discussion is really important to align on, and I'd love for you all to help me understand why what we're doing here is the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. there has to be a check in prepareDevices() to actually fail the allocation for a unhealthy device. But we still need the lock as there are 2 things happening on a unhealthy event:

// device health status is updated
- d.state.UpdateDeviceHealthStatus(device, Unhealthy) => this update is imp for above check to be reliable
// ResourceSlice is republished
- if err := d.pluginhelper.PublishResources(ctx, resources); err != nil => for future allocations

acquiring the nodePrepare lock will make sure we are simultaneously not updating the health status and also allocating the same device.

i have updated the device_state.go to check device health status before proceeding with the allocation.

if featuregates.Enabled(featuregates.DeviceHealthCheck) {
                   if device.Health == Unhealthy {
		                  return nil, fmt.Errorf("requested device is not healthy: %v", result.Device)
                    }
		}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acquiring the nodePrepare lock will make sure we are simultaneously not updating the health status and also allocating the same device.

But, does it? I might just not see it -- which sequence of events did you imagine, maybe?

Let me try to re-frame the problem space that I believe you are thinking about when you say "simultaneously not updating the health status and also allocating the same device".

You want to make sure that we don't allocate a device that's knowingly unhealthy. Does that sound about right?

I would agree -- let's try to do that :)

What do you think about the following mental model?

  1. Let's agree that this is generally a best-effort problem space -- between the device becoming unhealthy and us knowing there is unpredictable amount of time passing.
  2. Let's agree that this is an event propagation problem space.

I think we can also say:

  • Deep within func (s *DeviceState) Prepare() we must know, as early as possible, when a device is unhealthy. That's our final event consumer.
  • Event producer and event propagation pipeline are orthogonal to that.

The best we can do here is that we perform event propagation at minimal latency, towards the consumer.

Detour on latency

Because I find it interesting.

Between a GPU actually becoming unhealthy and us calling UpdateDeviceHealthStatus() as little time as possible should pass.

That is something that we can do and should do in this PR: minimize the fraction of event propagation latency that we control here.

Zooming out, this is always going to be a best effort strategy. Right now we seem to subscribe to GPU events more or less directly (but already now the event propagates through layers: there's the physical device, there's NVML, and then there's our process, and other layers that I am not even aware of). In the future, with NVSentinel, we're talking about event propagation across even more components.

Generally, there's a timeline attached that is unpredictable and we want to make sure we minimize latency at all steps. Here is one way to maybe think about that timeline:

T_1) a GPU actually becoming unhealthy (the 'physical' event)
T_2) us detecting it in component A
T_3) emitting an event in component A towards component B
T_4) potential-black-box-event-propagation -> after all emitted towards our GPU kubelet plugin
T_5) responding to that incoming event in our GPU kubelet plugin

Let's agree on the following: there's always a chance that we call func (s *DeviceState) Prepare() for a device after T_1 and before T_5.

We may just want to make sure we respond to the unhealthy event in the moment we receive it. We want to make sure we propagate to all its consumers ASAP.

That propagation itself does not need to be lock-protected; it just must happen fast.

Misc

Tangential, but potentially a helpful perspective: the "protected" data structure here (for now) is just device.Health and it's only mutated from within UpdateDeviceHealthStatus().

Then, also tangential, I notice that there already is a bit of a synchronization in your current patch proposal: UpdateDeviceHealthStatus() acquires the mutex on the DeviceState singleton:

func (s *DeviceState) UpdateDeviceHealthStatus(device *AllocatableDevice, hs HealthStatus) {
	s.Lock()
	defer s.Unlock()
...

We acquire the same mutex in func (s *DeviceState) Prepare().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very descriptive. Thank you for the effort JP. Yes, this is the flow of events i imagined when i was convinced that we needed to have the lock when updating the device health status and republishing the ResourceSlice.

unhealthy event -> lock to prevent any other operation on the device (mark device unhealthy + republish RS) -> unlock -> device unhealthy = my logic

but as you pointed out, i already acquire the lock when updating device status so the above lock is not really needed for republish and this is all best effort anyway in avoiding a potential race from the T_1 to T_5.

release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second))
if err != nil {
klog.Errorf("error acquiring prep/unprep lock for health status update: %v", err)
continue
Copy link
Collaborator

@jgehrcke jgehrcke Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, with continue we abort processing this event.

As a code reader, I want to get help here from a code comment -- to understand why it is okay to abort processing that. Will we re-process it later? How much later?

defer s.Unlock()

device.Health = healthstatus
klog.Infof("Update device sattus:%s healthstatus", device.UUID())
Copy link
Collaborator

@jgehrcke jgehrcke Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos, spaces etc.


Can you please explain / give an impression of how often this message would be logged? What actions/events will trigger this message to be logged?

When we understand that, let's have a brief think about a suitable verbosity level.

One of the questions that I have here: is this only logged for a health flip? Or could this also be logged for healthy->healthy? Should we have a different message/level depending on the state transition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, its only logged when healthy becomes unhealthy. But in future when we have remediation, it will also change to unhealthy->healthy.

{
Default: false,
PreRelease: featuregate.Alpha,
Version: version.MajorMinor(25, 8),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't know -- why is 25, 8 what we want here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be (25, 12) based on dec release?

case <-ctx.Done():
klog.V(6).Info("Stop processing device health notifications")
return
case device, ok := <-d.deviceHealthMonitor.Unhealthy():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about wrapping the logic in this case in a function and then call that function here. I think that would greatly help readability. It may also help with calling release() more reliably (if needed -- see discussion here).

@guptaNswati
Copy link
Contributor Author

Thanks for the patience in waiting on a review @guptaNswati.

Thanks to you for the review @elezar.

Thanks to @jgehrcke also.

The PR looks more lively :-D

@guptaNswati
Copy link
Contributor Author

@elezar @jgehrcke i incorporated some of the suggestions, and some not which i dint think are critical. please review again and let me know if something is blocking.

i will come back to it #689 (comment)

  • to add a retry
  • make sure release is called on all fails and success
  • making errors fatal

@jgehrcke
Copy link
Collaborator

Thanks!

@elezar @jgehrcke i incorporated some of the suggestions, and some not which i dint think are critical. please review again and let me know if something is blocking.

Let's for now assume that some things are blocking. Let us please carefully and mindfully drive each discussion thread to completion. I would really appreciate if you can drive that and make it easy for all of us by going through the questions with point-by-point replies. 📜 This is quite a bit of communication effort, but it will absolutely pay off when bringing the new code into production environments.

klog.V(6).Info("Stopping event-driven GPU health monitor...")
return
default:
event, ret := m.eventSet.Wait(5000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to look this up to understand what it's doing.

Ref docs are here: https://docs.nvidia.com/deploy/nvml-api/group__nvmlEvents.html#group__nvmlEvents_1g9714b0ca9a34c7a7780f87fee16b205c

argument is timeout in ms.

errors to be handled:

Image

continue
}
if ret != nvml.SUCCESS {
klog.Infof("Error waiting for event: %v; Marking all devices as unhealthy", ret)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem right.

Let us specifically handle the NVML_ERROR_GPU_IS_LOST case, and perform this 'nuke option' only then. Then we can also have a more precise log message (emitted on error level).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only NVML_ERROR_GPU_IS_LOST If we are not just checking !NVML_SUCCESS, TIMEOUT and UNKNOWN also seems imp.

If you see the returns of all event methods, these all seem common error types. I can do a helper of errortype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/return.go#L42

but then we mark the device unhealthy in each case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this will allow for proper error handling, this will also deviate from how its done in device-plugin. i can take it up as a follow-up to fix in both.

@jgehrcke thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging all the reviewers to help resolve this. Most other seems to be non-blocking.

@jgehrcke @elezar @shivamerla @ArangoGutierrez

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will also deviate from how its done in device-plugin.

Let's get this PR into the state we want it for the DRA driver without trying to match the device plugin implementation exactly. That is to say, consider this the next iteration of the device plugin implementation. The learnings that we take from this should be applied to the device plugin and iterated on from there.

One motivation is that this is a new feature and we don't have users currently expecting a certain behaviour. This gives us a lot more flexibility to change behaviour than if we had an existing implementation in use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implementation detail of how we want to handle NVML return errors so it wont impact the end user anyway whether we do it similar to device-plugin or improve it here. For user, all these error means unhealthy device and wont be published as part of ResourceSlice. And IMO, ret != nvml.SUCCESS is a valid check and covers a wide range of errors as all subsequent calls are dependent on this check.

For me, these make sense when we have proper remediation in place where based on the given error, the right action can be recommended.

continue
}

klog.Infof("Processing event %+v", event)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example for this log message, how it would look like in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I1027 18:03:39.337167       1 device_health.go:179] Processing event {Device:{Handle:0xe151b6b2fef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

@ArangoGutierrez ArangoGutierrez self-requested a review October 27, 2025 08:06
@klueska klueska added this to the v25.12.0 milestone Nov 5, 2025
Comment on lines +101 to +112
nvmlDeviceHealthMonitor, err := newNvmlDeviceHealthMonitor(ctx, config, state.allocatable, state.nvdevlib)
if err != nil {
return nil, fmt.Errorf("start nvmlDeviceHealthMonitor: %w", err)
}

driver.nvmlDeviceHealthMonitor = nvmlDeviceHealthMonitor

driver.wg.Add(1)
go func() {
defer driver.wg.Done()
driver.deviceHealthEvents(ctx, config.flags.nodeName)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think splitting creation from starting the health monitor makes sense.

This would lend itself to the following interface for a generic DeviceHealthMonitor:

type DeviceHealthMonitor interface {
    Start(context.Context) error
    Stop() error
    Unhealthy() <-chan *AllocatableDevice
}

And we can update the implementaiton here to:

Suggested change
nvmlDeviceHealthMonitor, err := newNvmlDeviceHealthMonitor(ctx, config, state.allocatable, state.nvdevlib)
if err != nil {
return nil, fmt.Errorf("start nvmlDeviceHealthMonitor: %w", err)
}
driver.nvmlDeviceHealthMonitor = nvmlDeviceHealthMonitor
driver.wg.Add(1)
go func() {
defer driver.wg.Done()
driver.deviceHealthEvents(ctx, config.flags.nodeName)
}()
deviceHealthMonitor, err := newNvmlDeviceHealthMonitor(config, state.allocatable, state.nvdevlib)
if err != nil {
return nil, fmt.Errorf("failed to create device health monitor: %w", err)
}
if err := deviceHealthMonitor.Start(ctx); err != nil {
return nil, fmt.Errorf("failed to start device health monitor: %w", err)
}
driver.deviceHealthMonitor = deviceHealthMonitor
driver.wg.Add(1)
go func() {
defer driver.wg.Done()
driver.deviceHealthEvents(ctx, config.flags.nodeName)
}()

Note that I have dropped the nvml prefix from everything except the function creating the monitor. This minimizes the changes when we add additional monitors.

We could even go so far as to ALSO rename newNvmlDeviceHealthMonitor to NewDeviceHealthMonitor (possibly with functional options) so that this code is alreay in the desired state (modulo additional constructor arguments).

state *DeviceState
pulock *flock.Flock
healthcheck *healthcheck
nvmlDeviceHealthMonitor *nvmlDeviceHealthMonitor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know that we will add new health monitors, let's rename the member to be more generic.

Suggested change
nvmlDeviceHealthMonitor *nvmlDeviceHealthMonitor
deviceHealthMonitor *nvmlDeviceHealthMonitor

Comment on lines +44 to +84
func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {
if nvdevlib.nvmllib == nil {
return nil, fmt.Errorf("nvml library is nil")
}

ctx, cancel := context.WithCancel(ctx)

m := &nvmlDeviceHealthMonitor{
nvmllib: nvdevlib.nvmllib,
unhealthy: make(chan *AllocatableDevice, len(allocatable)),
cancelContext: cancel,
}

if ret := m.nvmllib.Init(); ret != nvml.SUCCESS {
cancel()
return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
}

klog.V(6).Info("creating NVML events for device health monitor")
eventSet, ret := m.nvmllib.EventSetCreate()
if ret != nvml.SUCCESS {
_ = m.nvmllib.Shutdown()
cancel()
return nil, fmt.Errorf("failed to create event set: %w", ret)
}
m.eventSet = eventSet

m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable)

m.getDeviceByParentGiCiMap = getDeviceByParentGiCiMap(allocatable)

klog.V(6).Info("registering NVML events for device health monitor")
m.registerEventsForDevices()

skippedXids := m.xidsToSkip(config.flags.additionalXidsToIgnore)
klog.V(6).Info("started device health monitoring")
m.wg.Add(1)
go m.run(ctx, skippedXids)

return m, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned at the call site, I think it simplifies the implmentation if we split the construction of a monitor from actually starting it. What about updating this to:

Suggested change
func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {
if nvdevlib.nvmllib == nil {
return nil, fmt.Errorf("nvml library is nil")
}
ctx, cancel := context.WithCancel(ctx)
m := &nvmlDeviceHealthMonitor{
nvmllib: nvdevlib.nvmllib,
unhealthy: make(chan *AllocatableDevice, len(allocatable)),
cancelContext: cancel,
}
if ret := m.nvmllib.Init(); ret != nvml.SUCCESS {
cancel()
return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
}
klog.V(6).Info("creating NVML events for device health monitor")
eventSet, ret := m.nvmllib.EventSetCreate()
if ret != nvml.SUCCESS {
_ = m.nvmllib.Shutdown()
cancel()
return nil, fmt.Errorf("failed to create event set: %w", ret)
}
m.eventSet = eventSet
m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable)
m.getDeviceByParentGiCiMap = getDeviceByParentGiCiMap(allocatable)
klog.V(6).Info("registering NVML events for device health monitor")
m.registerEventsForDevices()
skippedXids := m.xidsToSkip(config.flags.additionalXidsToIgnore)
klog.V(6).Info("started device health monitoring")
m.wg.Add(1)
go m.run(ctx, skippedXids)
return m, nil
}
func newNvmlDeviceHealthMonitor(config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {
if nvdevlib.nvmllib == nil {
return nil, fmt.Errorf("nvml library is nil")
}
if ret := nvdevlib.nvmllib.Init(); ret != nvml.SUCCESS {
return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
}
defer func() {
_ = nvdevlib.nvmllib.Shutdown()
}()
m := &nvmlDeviceHealthMonitor{
nvmllib: nvdevlib.nvmllib,
unhealthy: make(chan *AllocatableDevice, len(allocatable)),
uuidToDeviceMap: getUUIDToDeviceMap(allocatable),
getDeviceByParentGiCiMap: getDeviceByParentGiCiMap(allocatable),
skippedXids: xidsToSkip(config.flags.additionalXidsToIgnore),
}
return m, nil
}
func (m *nvmlDeviceHealthMonitor) Start(ctx context.Context) (rerr error) {
if ret := m.nvmllib.Init(); ret != nvml.SUCCESS {
return fmt.Errorf("failed to initialize NVML: %v", ret)
}
// We shutdown nvml if this function returns with an error.
defer func() {
if rerr != nil {
_ = m.nvmllib.Shutdown()
}
}()
klog.V(6).Info("creating NVML events for device health monitor")
eventSet, ret := m.nvmllib.EventSetCreate()
if ret != nvml.SUCCESS {
return fmt.Errorf("failed to create event set: %w", ret)
}
ctx, cancel := context.WithCancel(ctx)
m.cancelContext = cancel
m.eventSet = eventSet
klog.V(6).Info("registering NVML events for device health monitor")
m.registerEventsForDevices()
klog.V(6).Info("started device health monitoring")
m.wg.Add(1)
go m.run(ctx, m.skippedXids)
return nil
}

Note that we now cleanly separate nvml errors that occur during setup and those that we have to handle while waiting for events.

unhealthy chan *AllocatableDevice
cancelContext context.CancelFunc
uuidToDeviceMap map[string]*AllocatableDevice
getDeviceByParentGiCiMap map[string]map[uint32]map[uint32]*AllocatableDevice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this member since it's not a function. I would even go so far as to add a type:

type placementToAllocatableDeviceMap map[string]map[uint32]map[uint32]*AllocatableDevice

that we can attach functions to (get(string,uint32,uint32), update(string,uint32,uint32,*AllocatableDevice)) to simplify implementations bellow.

This would mean that we update the member definition to something like:

Suggested change
getDeviceByParentGiCiMap map[string]map[uint32]map[uint32]*AllocatableDevice
deviceByParentGiCiMap placementToAllocatableDeviceMap

wg sync.WaitGroup
}

func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my other comment(s), what about introducing a top-level factory method where we can add additional constrution logic. Something like:

Suggested change
func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {
func NewDeviceHealthMonitor(config *Config, allocatable AllocatableDevices, nvdevlib *devicelib) (DeviceHealthMonitor, error) {
return newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error)
}

This may not look too important, but we could even add logic to instantiate a mock monitor based on an envvar (or move the feature flag logic from driver.go here and return a NULL (no-op) handler in the case where the feature is not enabled. This has the advantage of simplifying the callsite.

eventSet nvml.EventSet
unhealthy chan *AllocatableDevice
cancelContext context.CancelFunc
uuidToDeviceMap map[string]*AllocatableDevice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need two maps? Are the entries in the more complete map below not a subset of this?


m.wg.Wait()

_ = m.eventSet.Free()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is probably like this in the device plugin too, but should we at least log an error here?

continue
}

if event.EventType != nvml.EventTypeXidCriticalError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also for the device plugin) Can we track the follow-up action of checking why we don't check other supported types? Do whe have any indication of whether we ever see the log message below?

Comment on lines +195 to +202
var affectedDevice *AllocatableDevice
pMap, ok1 := m.getDeviceByParentGiCiMap[eventUUID]
if ok1 {
giMap, ok2 := pMap[event.GpuInstanceId]
if ok2 {
affectedDevice = giMap[event.ComputeInstanceId]
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we define a type for this map, we could simplify this as:

Suggested change
var affectedDevice *AllocatableDevice
pMap, ok1 := m.getDeviceByParentGiCiMap[eventUUID]
if ok1 {
giMap, ok2 := pMap[event.GpuInstanceId]
if ok2 {
affectedDevice = giMap[event.ComputeInstanceId]
}
}
affectedDevice := m.getDeviceByParentGiCiMap.get(
eventUUID,
event.GpuInstanceId,
event.ComputeInstanceId,
)

alternatively getting an element from an initialized map is "safe", so the following could also work:

            affectedDevice := m.getDeviceByParentGiCiMap[eventUUID][event.GpuInstanceId][event.ComputeInstanceId]

Comment on lines +257 to +263
if _, ok := deviceByParentGiCiMap[parentUUID]; !ok {
deviceByParentGiCiMap[parentUUID] = make(map[uint32]map[uint32]*AllocatableDevice)
}
if _, ok := deviceByParentGiCiMap[parentUUID][giID]; !ok {
deviceByParentGiCiMap[parentUUID][giID] = make(map[uint32]*AllocatableDevice)
}
deviceByParentGiCiMap[parentUUID][giID][ciID] = d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we define a type for this map we could factor this into:

func (p placementToAllocatableDeviceMap) put(uuid string, gi uint32, ci uint32, d *AllocatableDevice) {
	if _, ok := p[uuid]; !ok {
		p[uuid] = make(map[uint32]map[uint32]*AllocatableDevice)
	}
	if _, ok := p[uuid][gi]; !ok {
		p[uuid][gi] = make(map[uint32]*AllocatableDevice)
	}
	p[uuid][gi][ci] = d
}

and then replace the implementation here with:

Suggested change
if _, ok := deviceByParentGiCiMap[parentUUID]; !ok {
deviceByParentGiCiMap[parentUUID] = make(map[uint32]map[uint32]*AllocatableDevice)
}
if _, ok := deviceByParentGiCiMap[parentUUID][giID]; !ok {
deviceByParentGiCiMap[parentUUID][giID] = make(map[uint32]*AllocatableDevice)
}
deviceByParentGiCiMap[parentUUID][giID][ciID] = d
deviceByParentGiCiMap.put(parentUUID, giID, ciID, d)

Comment on lines +233 to +251
for _, d := range allocatable {
var parentUUID string
var giID, ciID uint32

switch d.Type() {
case GpuDeviceType:
parentUUID = d.UUID()
if parentUUID == "" {
continue
}
giID = FullGPUInstanceID
ciID = FullGPUInstanceID
case MigDeviceType:
parentUUID = d.Mig.parent.UUID
if parentUUID == "" {
continue
}
giID = d.Mig.giInfo.Id
ciID = d.Mig.ciInfo.Id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also move the put function below to the individual case statements:

Suggested change
for _, d := range allocatable {
var parentUUID string
var giID, ciID uint32
switch d.Type() {
case GpuDeviceType:
parentUUID = d.UUID()
if parentUUID == "" {
continue
}
giID = FullGPUInstanceID
ciID = FullGPUInstanceID
case MigDeviceType:
parentUUID = d.Mig.parent.UUID
if parentUUID == "" {
continue
}
giID = d.Mig.giInfo.Id
ciID = d.Mig.ciInfo.Id
for _, d := range allocatable {
switch d.Type() {
case GpuDeviceType:
uuid := d.UUID()
if uuid == "" {
continue
}
deviceByParentGiCiMap.put(uuid, FullGPUInstanceID, FullGPUInstanceID)
case MigDeviceType:
uuid := d.Mig.parent.UUID
if uuid == "" {
continue
}
deviceByParentGiCiMap.put(uuid, d.Mig.giInfo.Id, d.Mig.ciInfo.Id)

(we could even rename the put to something more meaningful and add the uuid == "" check there)

Comment on lines +295 to +305
// Add the list of hardcoded disabled (ignored) XIDs:
// http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4
// Application errors: the GPU should still be healthy.
ignoredXids := []uint64{
13, // Graphics Engine Exception
31, // GPU memory page fault
43, // GPU stopped processing
45, // Preemptive cleanup, due to previous errors
68, // Video processor exception
109, // Context Switch Timeout Error
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this list is taken from the device plugin, but handling them explicity at such a low-level is quite difficult to customize. Does it make sense to not port this logic over by instead define these as the default value for the envvar where we expose this to the user?

(I'm happy to do this as a follow-up though).

Just as a note for completeness. The GKE device plugin doesn't use this strategy for XIDs. They only element in their list is XID48 (see https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/0509b1f9f4b9a357b44ba65e7b508ded8bd5ecf0/pkg/gpu/nvidia/health_check/health_checker.go#L59).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

5 participants