Skip to content
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions cmd/gpu-kubelet-plugin/allocatable.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,20 @@ import (
resourceapi "k8s.io/api/resource/v1"
)

const (
// Healthy means that the device is healthy.
Healthy = "Healthy"
// Unhealthy means that the device is unhealthy.
Unhealthy = "Unhealthy"
)

type AllocatableDevices map[string]*AllocatableDevice

type AllocatableDevice struct {
Gpu *GpuInfo
Mig *MigDeviceInfo
// Defined similarly as https://pkg.go.dev/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1#Healthy
Health string
}

func (d AllocatableDevice) Type() string {
Expand Down Expand Up @@ -96,3 +105,7 @@ func (d AllocatableDevices) UUIDs() []string {
slices.Sort(uuids)
return uuids
}

func (d *AllocatableDevice) IsHealthy() bool {
return d.Health == Healthy
}
286 changes: 286 additions & 0 deletions cmd/gpu-kubelet-plugin/device_health.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package main

import (
"context"
"fmt"
"strconv"
"strings"
"sync"

"github.com/NVIDIA/go-nvml/pkg/nvml"
"k8s.io/klog/v2"
)

const (
FullGPUInstanceID uint32 = 0xFFFFFFFF
)

type deviceHealthMonitor struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type deviceHealthMonitor struct {
type nvmlDeviceHealthMonitor struct {

nvmllib nvml.Interface
eventSet nvml.EventSet
unhealthy chan *AllocatableDevice
cancelContext context.CancelFunc
uuidToDeviceMap map[string]*AllocatableDevice
wg sync.WaitGroup
}

func newDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*deviceHealthMonitor, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't just create the health monitor, it also starts it. Why not separate construction from actually starting the monitor? This will allow us to more easily swap out monitors (e.g. with NVSentinel) once we have those ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUestion: What functionality is required for a DeviceHealthMonitor from the perspective of the driver. Would returning the following interface here make sense:

type DeviceHealthMonitor interface {
    Start() error
    Stop()
    Unhealthy() <-chan *AllocatableDevice
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes an interface will allow more flexibility in case we want to integrate with other tools and for testing also but right now we only have NVML based implementation. A struct is more simpler and direct. Once we have the generic health API (which will be soon), all this will go away anyway as driver will not be responsible for starting or stopping the monitor, rather it will either talk to a grpc or http endpoint.

same logic to constructing and starting. Though for readability it may be better to separate the constructor and start.

if nvdevlib.nvmllib == nil {
return nil, fmt.Errorf("nvml library is nil")
}

ctx, cancel := context.WithCancel(ctx)

m := &deviceHealthMonitor{
nvmllib: nvdevlib.nvmllib,
unhealthy: make(chan *AllocatableDevice, len(allocatable)),
cancelContext: cancel,
}

if ret := m.nvmllib.Init(); ret != nvml.SUCCESS {
cancel()
return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
}
Comment on lines +57 to +60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a deferred nvmllib.Shutdown() here instead of handling each return separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer wont work here, as nvmllib.Shutdown() should not be called untill healthMonitor is closed.


klog.V(6).Info("creating NVML events for device health monitor")
eventSet, ret := m.nvmllib.EventSetCreate()
if ret != nvml.SUCCESS {
_ = m.nvmllib.Shutdown()
cancel()
return nil, fmt.Errorf("failed to create event set: %w", ret)
}
m.eventSet = eventSet

m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does allocatable change at all, or is it constant for the lifetime of the plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do update the health status of an allocatable when we get a unhealthy notification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that wasn't clear. Does the content of allocatable change in any way that would invalidate the map that we construct here meaning that it would need to be reconstructed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the only changing content of an allocatable device is its health status and that wont impact it. This map ([uuid] = device) is constructed in the very beginning when the plugin is started.

and we iterate on it to register event on each device (currently, we dont check any status here. we may do it in future on remediation) and send unhealthy notification.


klog.V(6).Info("registering NVML events for device health monitor")
m.registerDevicesForEvents()

skippedXids := m.xidsToSkip(config.flags.additionalXidsToIgnore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the device plugin implementation setting skippedIDs to all disables health checking. Why is this logic different here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health checking is feature-gated here, so it can be skipped without this flag. I should add a comment to make it explicit

klog.V(6).Info("started device health monitoring")
m.wg.Add(1)
go m.run(ctx, skippedXids)

return m, nil
}

func (m *deviceHealthMonitor) registerDevicesForEvents() {
eventMask := uint64(nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError)

processedUUIDs := make(map[string]bool)

for uuid, dev := range m.uuidToDeviceMap {
var u string
if dev.Type() == MigDeviceType {
u = dev.Mig.parent.UUID
} else {
u = uuid
}

if processedUUIDs[u] {
continue
}
gpu, ret := m.nvmllib.DeviceGetHandleByUUID(u)
if ret != nvml.SUCCESS {
klog.Infof("Unable to get device handle from UUID[%s]: %v; marking it as unhealthy", u, ret)
m.unhealthy <- dev
continue
}

supportedEvents, ret := gpu.GetSupportedEventTypes()
if ret != nvml.SUCCESS {
klog.Infof("unable to determine the supported events for %s: %v; marking it as unhealthy", u, ret)
m.unhealthy <- dev
continue
}

ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet)
if ret == nvml.ERROR_NOT_SUPPORTED {
klog.Warningf("Device %v is too old to support healthchecking.", u)
}
if ret != nvml.SUCCESS {
klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret)
m.unhealthy <- dev
}
processedUUIDs[u] = true
}
}

func (m *deviceHealthMonitor) Stop() {
if m == nil {
return
}
klog.V(6).Info("stopping health monitor")

if m.cancelContext != nil {
m.cancelContext()
}

m.wg.Wait()

_ = m.eventSet.Free()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is probably like this in the device plugin too, but should we at least log an error here?


if ret := m.nvmllib.Shutdown(); ret != nvml.SUCCESS {
klog.Warningf("failed to shutdown NVML: %v", ret)
}
close(m.unhealthy)
}

func getUUIDToDeviceMap(allocatable AllocatableDevices) map[string]*AllocatableDevice {
uuidToDeviceMap := make(map[string]*AllocatableDevice)

for _, d := range allocatable {
if u := d.UUID(); u != "" {
uuidToDeviceMap[u] = d
}
}
return uuidToDeviceMap
}

func (m *deviceHealthMonitor) run(ctx context.Context, skippedXids map[uint64]bool) {
defer m.wg.Done()
for {
select {
case <-ctx.Done():
klog.V(6).Info("Stopping event-driven GPU health monitor...")
return
default:
event, ret := m.eventSet.Wait(5000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to look this up to understand what it's doing.

Ref docs are here: https://docs.nvidia.com/deploy/nvml-api/group__nvmlEvents.html#group__nvmlEvents_1g9714b0ca9a34c7a7780f87fee16b205c

argument is timeout in ms.

errors to be handled:

Image

if ret == nvml.ERROR_TIMEOUT {
continue
}
if ret != nvml.SUCCESS {
klog.Infof("Error waiting for event: %v; Marking all devices as unhealthy", ret)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem right.

Let us specifically handle the NVML_ERROR_GPU_IS_LOST case, and perform this 'nuke option' only then. Then we can also have a more precise log message (emitted on error level).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only NVML_ERROR_GPU_IS_LOST If we are not just checking !NVML_SUCCESS, TIMEOUT and UNKNOWN also seems imp.

If you see the returns of all event methods, these all seem common error types. I can do a helper of errortype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/return.go#L42

but then we mark the device unhealthy in each case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this will allow for proper error handling, this will also deviate from how its done in device-plugin. i can take it up as a follow-up to fix in both.

@jgehrcke thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging all the reviewers to help resolve this. Most other seems to be non-blocking.

@jgehrcke @elezar @shivamerla @ArangoGutierrez

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will also deviate from how its done in device-plugin.

Let's get this PR into the state we want it for the DRA driver without trying to match the device plugin implementation exactly. That is to say, consider this the next iteration of the device plugin implementation. The learnings that we take from this should be applied to the device plugin and iterated on from there.

One motivation is that this is a new feature and we don't have users currently expecting a certain behaviour. This gives us a lot more flexibility to change behaviour than if we had an existing implementation in use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implementation detail of how we want to handle NVML return errors so it wont impact the end user anyway whether we do it similar to device-plugin or improve it here. For user, all these error means unhealthy device and wont be published as part of ResourceSlice. And IMO, ret != nvml.SUCCESS is a valid check and covers a wide range of errors as all subsequent calls are dependent on this check.

For me, these make sense when we have proper remediation in place where based on the given error, the right action can be recommended.

for _, dev := range m.uuidToDeviceMap {
m.unhealthy <- dev
}
continue
}

if event.EventType != nvml.EventTypeXidCriticalError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also for the device plugin) Can we track the follow-up action of checking why we don't check other supported types? Do whe have any indication of whether we ever see the log message below?

klog.Infof("Skipping non-nvmlEventTypeXidCriticalError event: %+v", event)
continue
}

if skippedXids[event.EventData] {
klog.Infof("Skipping event %+v", event)
continue
}

klog.Infof("Processing event %+v", event)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example for this log message, how it would look like in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I1027 18:03:39.337167       1 device_health.go:179] Processing event {Device:{Handle:0xe151b6b2fef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

eventUUID, ret := event.Device.GetUUID()
if ret != nvml.SUCCESS {
klog.Infof("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems bit aggressive to mark all devices as unhealthy on one invalid event. Should we log this as error and continue watch? cc @klueska

Copy link
Contributor Author

@guptaNswati guptaNswati Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@jgehrcke jgehrcke Oct 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say we should log an error and otherwise proceed. Even if what you've shown here is currently done in the device plugin.

By the way, this would have been a perfect opportunity for a better code comment in the legacy code:

Image

No blame, no emotions -- but this code comment does not add information in addition to the code. The interesting bit would be if there is a specific, non-obvious reason / relevance for this style of treatment.

For example, I wonder if this code was introduced to fix a bug. I wonder if it is even ever exercised.

The way it's written and with the git blame history, it seems like it was potentially added initially (defensively) and may never have been exercised in production.

for _, dev := range m.uuidToDeviceMap {
m.unhealthy <- dev
}
continue
}

var affectedDevice *AllocatableDevice
if event.GpuInstanceId != FullGPUInstanceID && event.ComputeInstanceId != FullGPUInstanceID {
affectedDevice = m.findMigDevice(eventUUID, event.GpuInstanceId, event.ComputeInstanceId)
} else {
affectedDevice = m.findGpuDevice(eventUUID)
}

if affectedDevice == nil {
klog.Infof("Ignoring event for unexpected device (UUID: %s, GI: %d, CI: %d)", eventUUID, event.GpuInstanceId, event.ComputeInstanceId)
continue
}

klog.Infof("Sending unhealthy notification for device %s due to event type: %v and event data: %d", affectedDevice.UUID(), event.EventType, event.EventData)
m.unhealthy <- affectedDevice
}
}
}

func (m *deviceHealthMonitor) Unhealthy() <-chan *AllocatableDevice {
return m.unhealthy
}

func (m *deviceHealthMonitor) findMigDevice(parentUUID string, giID uint32, ciID uint32) *AllocatableDevice {
for _, device := range m.uuidToDeviceMap {
if device.Type() != MigDeviceType {
continue
}

if device.Mig.parent.UUID == parentUUID &&
device.Mig.giInfo.Id == giID &&
device.Mig.ciInfo.Id == ciID {
return device
}
}
return nil
}

func (m *deviceHealthMonitor) findGpuDevice(uuid string) *AllocatableDevice {
device, exists := m.uuidToDeviceMap[uuid]
if exists && device.Type() == GpuDeviceType {
return device
}
return nil
}

// getAdditionalXids returns a list of additional Xids to skip from the specified string.
// The input is treaded as a comma-separated string and all valid uint64 values are considered as Xid values.
// Invalid values nare ignored.
func getAdditionalXids(input string) []uint64 {
if input == "" {
return nil
}

var additionalXids []uint64
klog.V(6).Infof("Creating a list of additional xids to ignore: [%s]", input)
for _, additionalXid := range strings.Split(input, ",") {
trimmed := strings.TrimSpace(additionalXid)
if trimmed == "" {
continue
}
xid, err := strconv.ParseUint(trimmed, 10, 64)
if err != nil {
klog.Infof("Ignoring malformed Xid value %v: %v", trimmed, err)
continue
}
additionalXids = append(additionalXids, xid)
}

return additionalXids
}

func (m *deviceHealthMonitor) xidsToSkip(additionalXids string) map[uint64]bool {
ignoredXids := []uint64{
13, // Graphics Engine Exception
31, // GPU memory page fault
43, // GPU stopped processing
45, // Preemptive cleanup, due to previous errors
68, // Video processor exception
109, // Context Switch Timeout Error
}

skippedXids := make(map[uint64]bool)
for _, id := range ignoredXids {
skippedXids[id] = true
}

for _, additionalXid := range getAdditionalXids(additionalXids) {
skippedXids[additionalXid] = true
}
return skippedXids
}
8 changes: 8 additions & 0 deletions cmd/gpu-kubelet-plugin/device_state.go
Original file line number Diff line number Diff line change
Expand Up @@ -550,6 +550,14 @@ func GetOpaqueDeviceConfigs(
return resultConfigs, nil
}

func (s *DeviceState) UpdateDeviceHealthStatus(device *AllocatableDevice, healthstatus string) {
s.Lock()
defer s.Unlock()

device.Health = healthstatus
klog.Infof("Update device sattus:%s healthstatus", device.UUID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("Update device sattus:%s healthstatus", device.UUID())
klog.Infof("Update device status:%s healthstatus", device.UUID())

Also, did you mean to output the actual status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix typo. no just the device.uuid

Copy link
Collaborator

@jgehrcke jgehrcke Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typos, spaces etc.


Can you please explain / give an impression of how often this message would be logged? What actions/events will trigger this message to be logged?

When we understand that, let's have a brief think about a suitable verbosity level.

One of the questions that I have here: is this only logged for a health flip? Or could this also be logged for healthy->healthy? Should we have a different message/level depending on the state transition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, its only logged when healthy becomes unhealthy. But in future when we have remediation, it will also change to unhealthy->healthy.

}

// TODO: Dynamic MIG is not yet supported with structured parameters.
// Refactor this to allow for the allocation of statically partitioned MIG
// devices.
Expand Down
Loading