Add GPU health check #689

guptaNswati · 2025-10-17T20:45:38Z

Addressing #360 to add preliminary health check similar to https://github.com/NVIDIA/k8s-device-plugin.

Clean follow-up of Gpu health check #545
addressing review comments
republising of resourceslice on health event
feature gate: --set featureGates.DeviceHealthCheck=true
xids: --set kubeletPlugin.gpus.additionalXidsToIgnore="n1,n2"

Test logs:

I1017 20:16:19.799738       1 device_health.go:179] Processing event {Device:{Handle:0xe4aea6b2fef0} EventType:8 EventData:43 GpuInstanceId:7 ComputeInstanceId:0}
I1017 20:16:19.799821       1 device_health.go:192] Event for mig device: &{<nil> 0x40006a0070 Healthy}
I1017 20:16:19.799843       1 device_health.go:202] Sending unhealthy notification for device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 due to event type: 8 and event data: 43
W1017 20:16:19.799870       1 driver.go:219] Received unhealthy notification for device: MIG-4d806f22-346a-5a1d-ac01-86b505cdf485
I1017 20:16:19.799884       1 device_state.go:558] Update device sattus:MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 healthstatus
I1017 20:16:19.799891       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c150 Healthy}
I1017 20:16:19.799955       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0ee0 Healthy}
I1017 20:16:19.799966       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0f50 Healthy}
I1017 20:16:19.799974       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0230 Healthy}
I1017 20:16:19.799983       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a02a0 Healthy}
I1017 20:16:19.799992       1 driver.go:230] Device is healthy, added to resoureslice: &{0x40002ae000 <nil> Healthy}
I1017 20:16:19.800000       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c0e0 Healthy}
W1017 20:16:19.800009       1 driver.go:233] Device:MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 with uuid:&{%!s(*main.GpuInfo=<nil>) %!s(*main.MigDeviceInfo=&{MIG-4d806f22-346a-5a1d-ac01-86b505cdf485 1g.12gb 0x40002503c0 0x400049c130 0x4000610090 0x400045ce40 0x4000610150 0x40004840f0 0009:01:00.0 0x40002fbf50}) Unhealthy} is unhealthy
I1017 20:16:19.800021       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a00e0 Healthy}
I1017 20:16:19.800031       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c1c0 Healthy}
I1017 20:16:19.800043       1 driver.go:230] Device is healthy, added to resoureslice: &{0x40002503c0 <nil> Healthy}
I1017 20:16:19.800050       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c000 Healthy}
I1017 20:16:19.800056       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0310 Healthy}
I1017 20:16:19.800063       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a0150 Healthy}
I1017 20:16:19.800069       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x400040c070 Healthy}
I1017 20:16:19.800076       1 driver.go:230] Device is healthy, added to resoureslice: &{<nil> 0x40006a01c0 Healthy}
I1017 20:16:19.800084       1 driver.go:237] [Rebulishing resourceslice with healthy devices
I1017 20:16:19.800142       1 driver.go:247] Successfully republished resources without unhealthy device MIG-4d806f22-346a-5a1d-ac01-86b505cdf485:
I1017 20:16:19.800178       1 resourceslicecontroller.go:647] "Existing slices" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" obsolete=[] current=["sc-starwars-mab9-b00-gpu.nvidia.com-kmcts"]
I1017 20:16:19.800209       1 resourceslicecontroller.go:724] "Need to update slice" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" matchIndex=0
I1017 20:16:19.800225       1 resourceslicecontroller.go:727] "Completed comparison" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" numObsolete=0 numMatchedSlices=1 numChangedMatchedSlices=1 numNewSlices=0
I1017 20:16:19.800230       1 resourceslicecontroller.go:743] "Kept generation because at most one update API call is necessary" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" generation=1
I1017 20:16:19.805795       1 round_trippers.go:632] "Response" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" verb="PUT" url="https://10.96.0.1:443/apis/resource.k8s.io/v1beta1/resourceslices/sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" status="200 OK" milliseconds=5
I1017 20:16:19.806290       1 resourceslicecontroller.go:779] "Updated existing resource slice" logger="ResourceSlice controller" poolName="sc-starwars-mab9-b00" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts"
I1017 20:16:19.807922       1 resourceslicecontroller.go:500] "ResourceSlice update" logger="ResourceSlice controller" slice="sc-starwars-mab9-b00-gpu.nvidia.com-kmcts" diff=<
	@@ -3,8 +3,8 @@
	   "name": "sc-starwars-mab9-b00-gpu.nvidia.com-kmcts",
	   "generateName": "sc-starwars-mab9-b00-gpu.nvidia.com-",
	   "uid": "7184f664-55c1-412d-bd99-4b46e7c23846",
	-  "resourceVersion": "59000652",
	-  "generation": 1,
	+  "resourceVersion": "59001011",
	+  "generation": 2,
	   "creationTimestamp": "2025-10-17T20:14:42Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1beta1",
	-    "time": "2025-10-17T20:14:42Z",
	+    "time": "2025-10-17T20:16:19Z
	.....
	.....
	
		     "name": "gpu-1",
	     "attributes": {
	      "architecture": {
	@@ -161,7 +496,7 @@
	     }
	    },
	    {
	-    "name": "gpu-0-mig-19-0-1",
	+    "name": "gpu-0-mig-19-1-1",
	     "attributes": {
	      "architecture": {
	       "string": "Hopper"
	@@ -197,56 +532,230 @@
	       "string": "mig"
	      },
	      "uuid": {
	-      "string": "MIG-4d806f22-346a-5a1d-ac01-86b505cdf485"
	-     }
	-    },

TLDR, This device had an event and is not added MIG-4d806f22-346a-5a1d-ac01-86b505cdf485

The device is picked back when driver is restarted.

Signed-off-by: Swati Gupta <[email protected]>

copy-pr-bot · 2025-10-17T20:45:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Swati Gupta <[email protected]>

cmd/gpu-kubelet-plugin/device_health.go

cmd/gpu-kubelet-plugin/driver.go

shivamerla · 2025-10-21T19:00:50Z

cmd/gpu-kubelet-plugin/device_health.go

+			klog.Infof("Processing event %+v", event)
+			eventUUID, ret := event.Device.GetUUID()
+			if ret != nvml.SUCCESS {
+				klog.Infof("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret)


This seems bit aggressive to mark all devices as unhealthy on one invalid event. Should we log this as error and continue watch? cc @klueska

its how its done in device-plugin https://github.com/NVIDIA/k8s-device-plugin/blob/main/internal/rm/health.go#L147

I'd also say we should log an error and otherwise proceed. Even if what you've shown here is currently done in the device plugin.

By the way, this would have been a perfect opportunity for a better code comment in the legacy code:

No blame, no emotions -- but this code comment does not add information in addition to the code. The interesting bit would be if there is a specific, non-obvious reason / relevance for this style of treatment.

For example, I wonder if this code was introduced to fix a bug. I wonder if it is even ever exercised.

The way it's written and with the git blame history, it seems like it was potentially added initially (defensively) and may never have been exercised in production.

cmd/gpu-kubelet-plugin/device_health.go

Signed-off-by: Swati Gupta <[email protected]>

cmd/gpu-kubelet-plugin/main.go

jgehrcke · 2025-10-23T09:47:02Z

cmd/gpu-kubelet-plugin/driver.go

+			}
+
+			if err := d.pluginhelper.PublishResources(ctx, resources); err != nil {
+				klog.Errorf("Failed to publish resources after device health status update: %v", err)


Naturally, I wonder why this error is only handled by logging a message. This might be the correct (or currently best) decision. But please walk the reader of the code through the arguments for ending up with that decision, using a brief code comment.

I'd like to understand thoughts here in the lines of "do not retry, because" or "this is implicitly retried later, because" or "we could crash the plugin here, but" or "the old resource slice state remains published, which is good enough", and so on. I am sure you've thought through all this.

None of this is obvious to the reader of the code, and I'd really love to have some help here to convince myself that this is the right way to handle this error.

(as always, it will pay off to document the current argumentation for our future selves, even if it's incomplete or so)

retrying make sense. and if retries also fails. It should be a fatal error as it means existing resourceslice is outdated

cmd/gpu-kubelet-plugin/driver.go

jgehrcke · 2025-10-23T09:54:17Z

cmd/gpu-kubelet-plugin/driver.go

+			klog.Warningf("Received unhealthy notification for device: %s", uuid)
+
+			if !device.IsHealthy() {
+				klog.V(6).Infof("Device: %s is aleady marked unhealthy. Skip republishing resourceslice", uuid)


In practice, how often could we see a log message like this?

What I see here right now: we can get the d.deviceHealthMonitor.Unhealthy() event multiple times, even if we had already processed that device before. I wonder how often we should expect that to happen.

there can be events burst. @lalitadithya showed me logs of device-plugin xid errors in a cluster which clearly showed same event logged multiple times.

@lalitadithya is it possible to share the log here.

elezar

Thanks for the patience in waiting on a review @guptaNswati.

cmd/gpu-kubelet-plugin/allocatable.go

cmd/gpu-kubelet-plugin/driver.go

elezar · 2025-10-23T09:07:03Z

cmd/gpu-kubelet-plugin/driver.go

+			release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second))
+			if err != nil {
+				klog.Errorf("error acquiring prep/unprep lock for health status update: %v", err)
+				continue


So this means that we don't mark the device as unhealth in this case. Is that the intended behaviour?

We should not abort the event. Probably should just log the error and update the device status anyway..

elezar · 2025-10-23T09:11:07Z

cmd/gpu-kubelet-plugin/driver.go

+				klog.V(6).Info("Successfully republished resources without unhealthy device")
+			}
+
+			release()


If we factor this logic into a function, then we could use a deferred call to release() after taking the lock. This may be less error prone if we do ever add code paths that return from this logic.

elezar · 2025-10-23T09:14:33Z

cmd/gpu-kubelet-plugin/main.go

+		&cli.StringFlag{
+			Name:        "additional-xids-to-ignore",
+			Usage:       "A comma-separated list of additional XIDs to ignore.",
+			Value:       "",
+			Destination: &flags.additionalXidsToIgnore,
+			EnvVars:     []string{"ADDITIONAL_XIDs_TO_IGNORE"},
+		},


In NVIDIA/k8s-device-plugin#1443 we added a list of EXPLICIT XIDs to consider fatal. This allows a user to:

Specify ignored XIDs (including all)

Specify SPECIFIC XIDs that are considered fatal (including all).

The important thing here is that it allows users to override the list of hard-coded XIDs that we currently track.

I am aware of this and was planning to do this as a follow-up as it recently got merged.

cmd/gpu-kubelet-plugin/device_health.go

elezar · 2025-10-23T09:37:25Z

cmd/gpu-kubelet-plugin/device_health.go

+	}
+	m.eventSet = eventSet
+
+	m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable)


Question: Does allocatable change at all, or is it constant for the lifetime of the plugin?

we do update the health status of an allocatable when we get a unhealthy notification.

Sorry, that wasn't clear. Does the content of allocatable change in any way that would invalidate the map that we construct here meaning that it would need to be reconstructed.

No, the only changing content of an allocatable device is its health status and that wont impact it. This map ([uuid] = device) is constructed in the very beginning when the plugin is started.

and we iterate on it to register event on each device (currently, we dont check any status here. we may do it in future on remediation) and send unhealthy notification.

cmd/gpu-kubelet-plugin/device_health.go

jgehrcke · 2025-10-23T10:03:07Z

cmd/gpu-kubelet-plugin/driver.go

+				continue
+			}
+
+			release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second))


Oh! Acquiring this lock here is a big decision.

Here, I really expect a concise / precise code comment explaining convincingly

why this lock must be acquired

how we guarantee that release() is always called

Maybe start by explaining what you think will go wrong when we do not acquire this lock here at this point.

This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().

i need to double check that lock is released on any failures.

We should be mindful of acquiring this lock. Let's do it only for a strong reason. When we introduced that lock we named it prepare/unprepare lock because it's meant for that purpose.

Maybe we should use it here, too -- but let's pretty please thoroughly identify that strong reason, and put it into a few English sentences that are convincing.

I am not yet satisfied yet here by our arguments. We need to discuss the alternatives considered, I need more help please to understand why this is the correct approach (I really mean that -- it's not that I know what we should do -- but I sense that we don't, as a collective, understand yet what we really want to do here).

I am thinking about the reason that you vaguely describe:

This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().

And I wonder: will it?

Can you describe a sequence of events where acquiring the lock would actually make that incoming nodePrepareResources() call not allocate an unhealthy device?

Here, we only update the ResourceSlice to un-announce any unhealthy device, right?

The moment we're done with that, we release the lock and the unchanged nodePrepareResources() call (that was waiting for us, hanging in lock acquisition) proceeds, trying to get what it wanted to get anyway.

When the kubelet already wants us to allocate an unhealthy device, then updating the ResourceSlice won't undo that (that gun is "fired" so to say), at least not reliably. Is that correct?

Let's say kubelet sends a PrepareResourceClaims() call our way and that may potentially result in allocating an unhealthy device (because the unhealthy notification arrived very recently).

Then I believe if we want a safe method to prevent that from happening we need to have logic within nodePrepareResource(). Does that make sense? (I quite literally don't know yet, you all have thought more about this than I did).

Specifically, I am wondring: do we need to call this new IsHealthy() somewhere in

k8s-dra-driver-gpu/cmd/gpu-kubelet-plugin/device_state.go

Line 261 in 7f591c2

func (s *DeviceState) prepareDevices(ctx context.Context, claim *resourceapi.ResourceClaim) (PreparedDevices, error) {

?

We can always actively fail a prepareDevices() if the only matching device turns out to be unhealthy right before we would have allocated it.

Maybe this is already similar?

device, exists := s.allocatable[result.Device] if !exists { return nil, fmt.Errorf("requested device is not allocatable: %v", result.Device) }

k8s-dra-driver-gpu/cmd/gpu-kubelet-plugin/device_state.go

Line 295 in 7f591c2

device, exists := s.allocatable[result.Device]

I might be completely off in what I say here. The point is that I need to convince myself that really we know what we're doing here and that we only acquire the PU lock if we absolutely have to -- because it has potentially devastating downsides to do it unnecessarily often.

This discussion is really important to align on, and I'd love for you all to help me understand why what we're doing here is the right thing.

Good point. there has to be a check in prepareDevices() to actually fail the allocation for a unhealthy device. But we still need the lock as there are 2 things happening on a unhealthy event:

// device health status is updated - d.state.UpdateDeviceHealthStatus(device, Unhealthy) => this update is imp for above check to be reliable // ResourceSlice is republished - if err := d.pluginhelper.PublishResources(ctx, resources); err != nil => for future allocations

acquiring the nodePrepare lock will make sure we are simultaneously not updating the health status and also allocating the same device.

i have updated the device_state.go to check device health status before proceeding with the allocation.

if featuregates.Enabled(featuregates.DeviceHealthCheck) { if device.Health == Unhealthy { return nil, fmt.Errorf("requested device is not healthy: %v", result.Device) } }

acquiring the nodePrepare lock will make sure we are simultaneously not updating the health status and also allocating the same device.

But, does it? I might just not see it -- which sequence of events did you imagine, maybe?

Let me try to re-frame the problem space that I believe you are thinking about when you say "simultaneously not updating the health status and also allocating the same device".

You want to make sure that we don't allocate a device that's knowingly unhealthy. Does that sound about right?

I would agree -- let's try to do that :)

What do you think about the following mental model?

Let's agree that this is generally a best-effort problem space -- between the device becoming unhealthy and us knowing there is unpredictable amount of time passing.

Let's agree that this is an event propagation problem space.

I think we can also say:

Deep within func (s *DeviceState) Prepare() we must know, as early as possible, when a device is unhealthy. That's our final event consumer.

Event producer and event propagation pipeline are orthogonal to that.

The best we can do here is that we perform event propagation at minimal latency, towards the consumer.

Detour on latency

Because I find it interesting.

Between a GPU actually becoming unhealthy and us calling UpdateDeviceHealthStatus() as little time as possible should pass.

That is something that we can do and should do in this PR: minimize the fraction of event propagation latency that we control here.

Zooming out, this is always going to be a best effort strategy. Right now we seem to subscribe to GPU events more or less directly (but already now the event propagates through layers: there's the physical device, there's NVML, and then there's our process, and other layers that I am not even aware of). In the future, with NVSentinel, we're talking about event propagation across even more components.

Generally, there's a timeline attached that is unpredictable and we want to make sure we minimize latency at all steps. Here is one way to maybe think about that timeline:

T_1) a GPU actually becoming unhealthy (the 'physical' event)
T_2) us detecting it in component A
T_3) emitting an event in component A towards component B
T_4) potential-black-box-event-propagation -> after all emitted towards our GPU kubelet plugin
T_5) responding to that incoming event in our GPU kubelet plugin

Let's agree on the following: there's always a chance that we call func (s *DeviceState) Prepare() for a device after T_1 and before T_5.

We may just want to make sure we respond to the unhealthy event in the moment we receive it. We want to make sure we propagate to all its consumers ASAP.

That propagation itself does not need to be lock-protected; it just must happen fast.

Misc

Tangential, but potentially a helpful perspective: the "protected" data structure here (for now) is just device.Health and it's only mutated from within UpdateDeviceHealthStatus().

Then, also tangential, I notice that there already is a bit of a synchronization in your current patch proposal: UpdateDeviceHealthStatus() acquires the mutex on the DeviceState singleton:

func (s *DeviceState) UpdateDeviceHealthStatus(device *AllocatableDevice, hs HealthStatus) { s.Lock() defer s.Unlock() ...

We acquire the same mutex in func (s *DeviceState) Prepare().

This is very descriptive. Thank you for the effort JP. Yes, this is the flow of events i imagined when i was convinced that we needed to have the lock when updating the device health status and republishing the ResourceSlice.

unhealthy event -> lock to prevent any other operation on the device (mark device unhealthy + republish RS) -> unlock -> device unhealthy = my logic

but as you pointed out, i already acquire the lock when updating device status so the above lock is not really needed for republish and this is all best effort anyway in avoiding a potential race from the T_1 to T_5.

jgehrcke · 2025-10-23T10:05:03Z

cmd/gpu-kubelet-plugin/driver.go

+			release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second))
+			if err != nil {
+				klog.Errorf("error acquiring prep/unprep lock for health status update: %v", err)
+				continue


As far as I understand, with continue we abort processing this event.

As a code reader, I want to get help here from a code comment -- to understand why it is okay to abort processing that. Will we re-process it later? How much later?

jgehrcke · 2025-10-23T10:09:58Z

cmd/gpu-kubelet-plugin/device_state.go

+	defer s.Unlock()
+
+	device.Health = healthstatus
+	klog.Infof("Update device sattus:%s healthstatus", device.UUID())


typos, spaces etc.

Can you please explain / give an impression of how often this message would be logged? What actions/events will trigger this message to be logged?

When we understand that, let's have a brief think about a suitable verbosity level.

One of the questions that I have here: is this only logged for a health flip? Or could this also be logged for healthy->healthy? Should we have a different message/level depending on the state transition?

right now, its only logged when healthy becomes unhealthy. But in future when we have remediation, it will also change to unhealthy->healthy.

jgehrcke · 2025-10-23T10:11:55Z

pkg/featuregates/featuregates.go

+		{
+			Default:    false,
+			PreRelease: featuregate.Alpha,
+			Version:    version.MajorMinor(25, 8),


I really don't know -- why is 25, 8 what we want here?

This should be (25, 12) based on dec release?

jgehrcke · 2025-10-23T10:14:20Z

cmd/gpu-kubelet-plugin/driver.go

+		case <-ctx.Done():
+			klog.V(6).Info("Stop processing device health notifications")
+			return
+		case device, ok := <-d.deviceHealthMonitor.Unhealthy():


What do you think about wrapping the logic in this case in a function and then call that function here. I think that would greatly help readability. It may also help with calling release() more reliably (if needed -- see discussion here).

cmd/gpu-kubelet-plugin/driver.go

cmd/gpu-kubelet-plugin/device_health.go

guptaNswati · 2025-10-23T18:56:21Z

Thanks for the patience in waiting on a review @guptaNswati.

Thanks to you for the review @elezar.

Thanks to @jgehrcke also.

The PR looks more lively :-D

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-10-24T00:03:31Z

@elezar @jgehrcke i incorporated some of the suggestions, and some not which i dint think are critical. please review again and let me know if something is blocking.

i will come back to it #689 (comment)

to add a retry
make sure release is called on all fails and success
making errors fatal

jgehrcke · 2025-10-24T16:50:29Z

Thanks!

@elezar @jgehrcke i incorporated some of the suggestions, and some not which i dint think are critical. please review again and let me know if something is blocking.

Let's for now assume that some things are blocking. Let us please carefully and mindfully drive each discussion thread to completion. I would really appreciate if you can drive that and make it easy for all of us by going through the questions with point-by-point replies. 📜 This is quite a bit of communication effort, but it will absolutely pay off when bringing the new code into production environments.

Signed-off-by: Swati Gupta <[email protected]>

jgehrcke · 2025-10-25T13:25:00Z

cmd/gpu-kubelet-plugin/device_health.go

+			klog.V(6).Info("Stopping event-driven GPU health monitor...")
+			return
+		default:
+			event, ret := m.eventSet.Wait(5000)


I needed to look this up to understand what it's doing.

Ref docs are here: https://docs.nvidia.com/deploy/nvml-api/group__nvmlEvents.html#group__nvmlEvents_1g9714b0ca9a34c7a7780f87fee16b205c

argument is timeout in ms.

errors to be handled:

jgehrcke · 2025-10-25T13:28:08Z

cmd/gpu-kubelet-plugin/device_health.go

+				continue
+			}
+			if ret != nvml.SUCCESS {
+				klog.Infof("Error waiting for event: %v; Marking all devices as unhealthy", ret)


This does not seem right.

Let us specifically handle the NVML_ERROR_GPU_IS_LOST case, and perform this 'nuke option' only then. Then we can also have a more precise log message (emitted on error level).

why only NVML_ERROR_GPU_IS_LOST If we are not just checking !NVML_SUCCESS, TIMEOUT and UNKNOWN also seems imp.

If you see the returns of all event methods, these all seem common error types. I can do a helper of errortype.

something like https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/return.go#L42

but then we mark the device unhealthy in each case?

while this will allow for proper error handling, this will also deviate from how its done in device-plugin. i can take it up as a follow-up to fix in both.

@jgehrcke thoughts?

Pinging all the reviewers to help resolve this. Most other seems to be non-blocking.

@jgehrcke @elezar @shivamerla @ArangoGutierrez

this will also deviate from how its done in device-plugin.

Let's get this PR into the state we want it for the DRA driver without trying to match the device plugin implementation exactly. That is to say, consider this the next iteration of the device plugin implementation. The learnings that we take from this should be applied to the device plugin and iterated on from there.

One motivation is that this is a new feature and we don't have users currently expecting a certain behaviour. This gives us a lot more flexibility to change behaviour than if we had an existing implementation in use.

This is implementation detail of how we want to handle NVML return errors so it wont impact the end user anyway whether we do it similar to device-plugin or improve it here. For user, all these error means unhealthy device and wont be published as part of ResourceSlice. And IMO, ret != nvml.SUCCESS is a valid check and covers a wide range of errors as all subsequent calls are dependent on this check.

For me, these make sense when we have proper remediation in place where based on the given error, the right action can be recommended.

jgehrcke · 2025-10-25T13:29:36Z

cmd/gpu-kubelet-plugin/device_health.go

+				continue
+			}
+
+			klog.Infof("Processing event %+v", event)


Do you have an example for this log message, how it would look like in practice?

I1027 18:03:39.337167 1 device_health.go:179] Processing event {Device:{Handle:0xe151b6b2fef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

Signed-off-by: Swati Gupta <[email protected]>

elezar · 2025-11-06T09:05:13Z

cmd/gpu-kubelet-plugin/driver.go

+		nvmlDeviceHealthMonitor, err := newNvmlDeviceHealthMonitor(ctx, config, state.allocatable, state.nvdevlib)
+		if err != nil {
+			return nil, fmt.Errorf("start nvmlDeviceHealthMonitor: %w", err)
+		}
+
+		driver.nvmlDeviceHealthMonitor = nvmlDeviceHealthMonitor
+
+		driver.wg.Add(1)
+		go func() {
+			defer driver.wg.Done()
+			driver.deviceHealthEvents(ctx, config.flags.nodeName)
+		}()


I think splitting creation from starting the health monitor makes sense.

This would lend itself to the following interface for a generic DeviceHealthMonitor:

type DeviceHealthMonitor interface { Start(context.Context) error Stop() error Unhealthy() <-chan *AllocatableDevice }

And we can update the implementaiton here to:

Suggested change

nvmlDeviceHealthMonitor, err := newNvmlDeviceHealthMonitor(ctx, config, state.allocatable, state.nvdevlib)

if err != nil {

return nil, fmt.Errorf("start nvmlDeviceHealthMonitor: %w", err)

}

driver.nvmlDeviceHealthMonitor = nvmlDeviceHealthMonitor

driver.wg.Add(1)

go func() {

defer driver.wg.Done()

driver.deviceHealthEvents(ctx, config.flags.nodeName)

}()

deviceHealthMonitor, err := newNvmlDeviceHealthMonitor(config, state.allocatable, state.nvdevlib)

if err != nil {

return nil, fmt.Errorf("failed to create device health monitor: %w", err)

}

if err := deviceHealthMonitor.Start(ctx); err != nil {

return nil, fmt.Errorf("failed to start device health monitor: %w", err)

}

driver.deviceHealthMonitor = deviceHealthMonitor

driver.wg.Add(1)

go func() {

defer driver.wg.Done()

driver.deviceHealthEvents(ctx, config.flags.nodeName)

}()

Note that I have dropped the nvml prefix from everything except the function creating the monitor. This minimizes the changes when we add additional monitors.

We could even go so far as to ALSO rename newNvmlDeviceHealthMonitor to NewDeviceHealthMonitor (possibly with functional options) so that this code is alreay in the desired state (modulo additional constructor arguments).

elezar · 2025-11-06T09:06:11Z

cmd/gpu-kubelet-plugin/driver.go

+	state                   *DeviceState
+	pulock                  *flock.Flock
+	healthcheck             *healthcheck
+	nvmlDeviceHealthMonitor *nvmlDeviceHealthMonitor


Since we know that we will add new health monitors, let's rename the member to be more generic.

Suggested change

nvmlDeviceHealthMonitor *nvmlDeviceHealthMonitor

deviceHealthMonitor *nvmlDeviceHealthMonitor

elezar · 2025-11-06T09:12:33Z

cmd/gpu-kubelet-plugin/device_health.go

+func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {
+	if nvdevlib.nvmllib == nil {
+		return nil, fmt.Errorf("nvml library is nil")
+	}
+
+	ctx, cancel := context.WithCancel(ctx)
+
+	m := &nvmlDeviceHealthMonitor{
+		nvmllib:       nvdevlib.nvmllib,
+		unhealthy:     make(chan *AllocatableDevice, len(allocatable)),
+		cancelContext: cancel,
+	}
+
+	if ret := m.nvmllib.Init(); ret != nvml.SUCCESS {
+		cancel()
+		return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
+	}
+
+	klog.V(6).Info("creating NVML events for device health monitor")
+	eventSet, ret := m.nvmllib.EventSetCreate()
+	if ret != nvml.SUCCESS {
+		_ = m.nvmllib.Shutdown()
+		cancel()
+		return nil, fmt.Errorf("failed to create event set: %w", ret)
+	}
+	m.eventSet = eventSet
+
+	m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable)
+
+	m.getDeviceByParentGiCiMap = getDeviceByParentGiCiMap(allocatable)
+
+	klog.V(6).Info("registering NVML events for device health monitor")
+	m.registerEventsForDevices()
+
+	skippedXids := m.xidsToSkip(config.flags.additionalXidsToIgnore)
+	klog.V(6).Info("started device health monitoring")
+	m.wg.Add(1)
+	go m.run(ctx, skippedXids)
+
+	return m, nil
+}


As mentioned at the call site, I think it simplifies the implmentation if we split the construction of a monitor from actually starting it. What about updating this to:

Suggested change

func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {

if nvdevlib.nvmllib == nil {

return nil, fmt.Errorf("nvml library is nil")

}

ctx, cancel := context.WithCancel(ctx)

m := &nvmlDeviceHealthMonitor{

nvmllib: nvdevlib.nvmllib,

unhealthy: make(chan *AllocatableDevice, len(allocatable)),

cancelContext: cancel,

}

if ret := m.nvmllib.Init(); ret != nvml.SUCCESS {

cancel()

return nil, fmt.Errorf("failed to initialize NVML: %v", ret)

}

klog.V(6).Info("creating NVML events for device health monitor")

eventSet, ret := m.nvmllib.EventSetCreate()

if ret != nvml.SUCCESS {

_ = m.nvmllib.Shutdown()

cancel()

return nil, fmt.Errorf("failed to create event set: %w", ret)

}

m.eventSet = eventSet

m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable)

m.getDeviceByParentGiCiMap = getDeviceByParentGiCiMap(allocatable)

klog.V(6).Info("registering NVML events for device health monitor")

m.registerEventsForDevices()

skippedXids := m.xidsToSkip(config.flags.additionalXidsToIgnore)

klog.V(6).Info("started device health monitoring")

m.wg.Add(1)

go m.run(ctx, skippedXids)

return m, nil

}

func newNvmlDeviceHealthMonitor(config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {

if nvdevlib.nvmllib == nil {

return nil, fmt.Errorf("nvml library is nil")

}

if ret := nvdevlib.nvmllib.Init(); ret != nvml.SUCCESS {

return nil, fmt.Errorf("failed to initialize NVML: %v", ret)

}

defer func() {

_ = nvdevlib.nvmllib.Shutdown()

}()

m := &nvmlDeviceHealthMonitor{

nvmllib: nvdevlib.nvmllib,

unhealthy: make(chan *AllocatableDevice, len(allocatable)),

uuidToDeviceMap: getUUIDToDeviceMap(allocatable),

getDeviceByParentGiCiMap: getDeviceByParentGiCiMap(allocatable),

skippedXids: xidsToSkip(config.flags.additionalXidsToIgnore),

}

return m, nil

}

func (m *nvmlDeviceHealthMonitor) Start(ctx context.Context) (rerr error) {

if ret := m.nvmllib.Init(); ret != nvml.SUCCESS {

return fmt.Errorf("failed to initialize NVML: %v", ret)

}

// We shutdown nvml if this function returns with an error.

defer func() {

if rerr != nil {

_ = m.nvmllib.Shutdown()

}

}()

klog.V(6).Info("creating NVML events for device health monitor")

eventSet, ret := m.nvmllib.EventSetCreate()

if ret != nvml.SUCCESS {

return fmt.Errorf("failed to create event set: %w", ret)

}

ctx, cancel := context.WithCancel(ctx)

m.cancelContext = cancel

m.eventSet = eventSet

klog.V(6).Info("registering NVML events for device health monitor")

m.registerEventsForDevices()

klog.V(6).Info("started device health monitoring")

m.wg.Add(1)

go m.run(ctx, m.skippedXids)

return nil

}

Note that we now cleanly separate nvml errors that occur during setup and those that we have to handle while waiting for events.

elezar · 2025-11-06T09:17:44Z

cmd/gpu-kubelet-plugin/device_health.go

+	unhealthy                chan *AllocatableDevice
+	cancelContext            context.CancelFunc
+	uuidToDeviceMap          map[string]*AllocatableDevice
+	getDeviceByParentGiCiMap map[string]map[uint32]map[uint32]*AllocatableDevice


Let's rename this member since it's not a function. I would even go so far as to add a type:

type placementToAllocatableDeviceMap map[string]map[uint32]map[uint32]*AllocatableDevice

that we can attach functions to (get(string,uint32,uint32), update(string,uint32,uint32,*AllocatableDevice)) to simplify implementations bellow.

This would mean that we update the member definition to something like:

Suggested change

getDeviceByParentGiCiMap map[string]map[uint32]map[uint32]*AllocatableDevice

deviceByParentGiCiMap placementToAllocatableDeviceMap

elezar · 2025-11-06T09:21:48Z

cmd/gpu-kubelet-plugin/device_health.go

+	wg                       sync.WaitGroup
+}
+
+func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {


Related to my other comment(s), what about introducing a top-level factory method where we can add additional constrution logic. Something like:

Suggested change

func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {

func NewDeviceHealthMonitor(config *Config, allocatable AllocatableDevices, nvdevlib *devicelib) (DeviceHealthMonitor, error) {

return newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error)

}

This may not look too important, but we could even add logic to instantiate a mock monitor based on an envvar (or move the feature flag logic from driver.go here and return a NULL (no-op) handler in the case where the feature is not enabled. This has the advantage of simplifying the callsite.

elezar · 2025-11-06T09:23:01Z

cmd/gpu-kubelet-plugin/device_health.go

+	eventSet                 nvml.EventSet
+	unhealthy                chan *AllocatableDevice
+	cancelContext            context.CancelFunc
+	uuidToDeviceMap          map[string]*AllocatableDevice


Do we need two maps? Are the entries in the more complete map below not a subset of this?

elezar · 2025-11-06T09:25:57Z

cmd/gpu-kubelet-plugin/device_health.go

+
+	m.wg.Wait()
+
+	_ = m.eventSet.Free()


I know this is probably like this in the device plugin too, but should we at least log an error here?

elezar · 2025-11-06T09:28:12Z

cmd/gpu-kubelet-plugin/device_health.go

+				continue
+			}
+
+			if event.EventType != nvml.EventTypeXidCriticalError {


(also for the device plugin) Can we track the follow-up action of checking why we don't check other supported types? Do whe have any indication of whether we ever see the log message below?

elezar · 2025-11-06T09:30:54Z

cmd/gpu-kubelet-plugin/device_health.go

+			var affectedDevice *AllocatableDevice
+			pMap, ok1 := m.getDeviceByParentGiCiMap[eventUUID]
+			if ok1 {
+				giMap, ok2 := pMap[event.GpuInstanceId]
+				if ok2 {
+					affectedDevice = giMap[event.ComputeInstanceId]
+				}
+			}


Assuming we define a type for this map, we could simplify this as:

Suggested change

var affectedDevice *AllocatableDevice

pMap, ok1 := m.getDeviceByParentGiCiMap[eventUUID]

if ok1 {

giMap, ok2 := pMap[event.GpuInstanceId]

if ok2 {

affectedDevice = giMap[event.ComputeInstanceId]

}

}

affectedDevice := m.getDeviceByParentGiCiMap.get(

eventUUID,

event.GpuInstanceId,

event.ComputeInstanceId,

)

alternatively getting an element from an initialized map is "safe", so the following could also work:

affectedDevice := m.getDeviceByParentGiCiMap[eventUUID][event.GpuInstanceId][event.ComputeInstanceId]

elezar · 2025-11-06T09:35:22Z

cmd/gpu-kubelet-plugin/device_health.go

+		if _, ok := deviceByParentGiCiMap[parentUUID]; !ok {
+			deviceByParentGiCiMap[parentUUID] = make(map[uint32]map[uint32]*AllocatableDevice)
+		}
+		if _, ok := deviceByParentGiCiMap[parentUUID][giID]; !ok {
+			deviceByParentGiCiMap[parentUUID][giID] = make(map[uint32]*AllocatableDevice)
+		}
+		deviceByParentGiCiMap[parentUUID][giID][ciID] = d


Assuming we define a type for this map we could factor this into:

func (p placementToAllocatableDeviceMap) put(uuid string, gi uint32, ci uint32, d *AllocatableDevice) { if _, ok := p[uuid]; !ok { p[uuid] = make(map[uint32]map[uint32]*AllocatableDevice) } if _, ok := p[uuid][gi]; !ok { p[uuid][gi] = make(map[uint32]*AllocatableDevice) } p[uuid][gi][ci] = d }

and then replace the implementation here with:

Suggested change

if _, ok := deviceByParentGiCiMap[parentUUID]; !ok {

deviceByParentGiCiMap[parentUUID] = make(map[uint32]map[uint32]*AllocatableDevice)

}

if _, ok := deviceByParentGiCiMap[parentUUID][giID]; !ok {

deviceByParentGiCiMap[parentUUID][giID] = make(map[uint32]*AllocatableDevice)

}

deviceByParentGiCiMap[parentUUID][giID][ciID] = d

deviceByParentGiCiMap.put(parentUUID, giID, ciID, d)

elezar · 2025-11-06T09:41:40Z

cmd/gpu-kubelet-plugin/device_health.go

+	for _, d := range allocatable {
+		var parentUUID string
+		var giID, ciID uint32
+
+		switch d.Type() {
+		case GpuDeviceType:
+			parentUUID = d.UUID()
+			if parentUUID == "" {
+				continue
+			}
+			giID = FullGPUInstanceID
+			ciID = FullGPUInstanceID
+		case MigDeviceType:
+			parentUUID = d.Mig.parent.UUID
+			if parentUUID == "" {
+				continue
+			}
+			giID = d.Mig.giInfo.Id
+			ciID = d.Mig.ciInfo.Id


We could also move the put function below to the individual case statements:

Suggested change

for _, d := range allocatable {

var parentUUID string

var giID, ciID uint32

switch d.Type() {

case GpuDeviceType:

parentUUID = d.UUID()

if parentUUID == "" {

continue

}

giID = FullGPUInstanceID

ciID = FullGPUInstanceID

case MigDeviceType:

parentUUID = d.Mig.parent.UUID

if parentUUID == "" {

continue

}

giID = d.Mig.giInfo.Id

ciID = d.Mig.ciInfo.Id

for _, d := range allocatable {

switch d.Type() {

case GpuDeviceType:

uuid := d.UUID()

if uuid == "" {

continue

}

deviceByParentGiCiMap.put(uuid, FullGPUInstanceID, FullGPUInstanceID)

case MigDeviceType:

uuid := d.Mig.parent.UUID

if uuid == "" {

continue

}

deviceByParentGiCiMap.put(uuid, d.Mig.giInfo.Id, d.Mig.ciInfo.Id)

(we could even rename the put to something more meaningful and add the uuid == "" check there)

elezar · 2025-11-06T10:03:32Z

cmd/gpu-kubelet-plugin/device_health.go

+	// Add the list of hardcoded disabled (ignored) XIDs:
+	// http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4
+	// Application errors: the GPU should still be healthy.
+	ignoredXids := []uint64{
+		13,  // Graphics Engine Exception
+		31,  // GPU memory page fault
+		43,  // GPU stopped processing
+		45,  // Preemptive cleanup, due to previous errors
+		68,  // Video processor exception
+		109, // Context Switch Timeout Error
+	}


I know that this list is taken from the device plugin, but handling them explicity at such a low-level is quite difficult to customize. Does it make sense to not port this logic over by instead define these as the default value for the envvar where we expose this to the user?

(I'm happy to do this as a follow-up though).

Just as a note for completeness. The GKE device plugin doesn't use this strategy for XIDs. They only element in their list is XID48 (see https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/0509b1f9f4b9a357b44ba65e7b508ded8bd5ecf0/pkg/gpu/nvidia/health_check/health_checker.go#L59).

Add GPU health check

0e5dd5e

Signed-off-by: Swati Gupta <[email protected]>

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Oct 17, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Oct 17, 2025

guptaNswati mentioned this pull request Oct 17, 2025

Gpu health check #545

Closed

guptaNswati requested a review from shivamerla October 17, 2025 20:46

lint fixes

a7c85c7

Signed-off-by: Swati Gupta <[email protected]>

shivamerla reviewed Oct 21, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/device_health.go Outdated Show resolved Hide resolved

shivamerla reviewed Oct 21, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/device_health.go Outdated Show resolved Hide resolved

shivamerla reviewed Oct 21, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/device_health.go Show resolved Hide resolved

shivamerla reviewed Oct 21, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/device_health.go Outdated Show resolved Hide resolved

shivamerla reviewed Oct 21, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/driver.go Show resolved Hide resolved

shivamerla reviewed Oct 21, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/driver.go Show resolved Hide resolved

shivamerla reviewed Oct 21, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/device_health.go Outdated Show resolved Hide resolved

guptaNswati added 2 commits October 21, 2025 20:42

add ctx and lock

896ef3a

Signed-off-by: Swati Gupta <[email protected]>

Lint fixes

599fb15

Signed-off-by: Swati Gupta <[email protected]>

jgehrcke reviewed Oct 23, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/main.go Show resolved Hide resolved

jgehrcke reviewed Oct 23, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/driver.go Outdated Show resolved Hide resolved

jgehrcke reviewed Oct 23, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/driver.go Show resolved Hide resolved

jgehrcke reviewed Oct 23, 2025

View reviewed changes

elezar reviewed Oct 23, 2025

View reviewed changes

jgehrcke reviewed Oct 23, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/driver.go Outdated Show resolved Hide resolved

jgehrcke reviewed Oct 23, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/driver.go Show resolved Hide resolved

jgehrcke reviewed Oct 23, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/device_health.go Outdated Show resolved Hide resolved

jgehrcke reviewed Oct 23, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/device_health.go Show resolved Hide resolved

address review comment of eventMap and health string

ae7211e

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the device-health-check branch from 43fcbc8 to ae7211e Compare October 23, 2025 23:55

check device status when preparing device

2d7618d

Signed-off-by: Swati Gupta <[email protected]>

jgehrcke reviewed Oct 25, 2025

View reviewed changes

ArangoGutierrez self-requested a review October 27, 2025 08:06

add markAllDevicesUnhealthy and remove lock

1a950ca

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati requested review from elezar, jgehrcke and shivamerla November 3, 2025 18:58

klueska added this to the v25.12.0 milestone Nov 5, 2025

klueska assigned guptaNswati Nov 5, 2025

elezar reviewed Nov 6, 2025

View reviewed changes

	nvmlDeviceHealthMonitor *nvmlDeviceHealthMonitor
	deviceHealthMonitor *nvmlDeviceHealthMonitor

	getDeviceByParentGiCiMap map[string]map[uint32]map[uint32]*AllocatableDevice
	deviceByParentGiCiMap placementToAllocatableDeviceMap

-func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) {
+func NewDeviceHealthMonitor(config *Config, allocatable AllocatableDevices, nvdevlib *devicelib) (DeviceHealthMonitor, error) {
+    return newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error)
+}

Add GPU health check #689

Are you sure you want to change the base?

Add GPU health check #689

Conversation

guptaNswati commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jgehrcke Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgehrcke Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Oct 17, 2025 •

edited

Loading

guptaNswati Oct 21, 2025 •

edited

Loading

jgehrcke Oct 25, 2025 •

edited

Loading

jgehrcke Oct 23, 2025 •

edited

Loading

jgehrcke Oct 24, 2025 •

edited

Loading

jgehrcke Oct 24, 2025 •

edited

Loading

jgehrcke Oct 23, 2025 •

edited

Loading

jgehrcke Oct 23, 2025 •

edited

Loading