-
Notifications
You must be signed in to change notification settings - Fork 94
Add GPU health check #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add GPU health check #689
Conversation
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
| klog.Infof("Processing event %+v", event) | ||
| eventUUID, ret := event.Device.GetUUID() | ||
| if ret != nvml.SUCCESS { | ||
| klog.Infof("Failed to determine uuid for event %v: %v; Marking all devices as unhealthy.", event, ret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems bit aggressive to mark all devices as unhealthy on one invalid event. Should we log this as error and continue watch? cc @klueska
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its how its done in device-plugin https://github.com/NVIDIA/k8s-device-plugin/blob/main/internal/rm/health.go#L147
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also say we should log an error and otherwise proceed. Even if what you've shown here is currently done in the device plugin.
By the way, this would have been a perfect opportunity for a better code comment in the legacy code:
No blame, no emotions -- but this code comment does not add information in addition to the code. The interesting bit would be if there is a specific, non-obvious reason / relevance for this style of treatment.
For example, I wonder if this code was introduced to fix a bug. I wonder if it is even ever exercised.
The way it's written and with the git blame history, it seems like it was potentially added initially (defensively) and may never have been exercised in production.
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
| } | ||
|
|
||
| if err := d.pluginhelper.PublishResources(ctx, resources); err != nil { | ||
| klog.Errorf("Failed to publish resources after device health status update: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naturally, I wonder why this error is only handled by logging a message. This might be the correct (or currently best) decision. But please walk the reader of the code through the arguments for ending up with that decision, using a brief code comment.
I'd like to understand thoughts here in the lines of "do not retry, because" or "this is implicitly retried later, because" or "we could crash the plugin here, but" or "the old resource slice state remains published, which is good enough", and so on. I am sure you've thought through all this.
None of this is obvious to the reader of the code, and I'd really love to have some help here to convince myself that this is the right way to handle this error.
(as always, it will pay off to document the current argumentation for our future selves, even if it's incomplete or so)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retrying make sense. and if retries also fails. It should be a fatal error as it means existing resourceslice is outdated
cmd/gpu-kubelet-plugin/driver.go
Outdated
| klog.Warningf("Received unhealthy notification for device: %s", uuid) | ||
|
|
||
| if !device.IsHealthy() { | ||
| klog.V(6).Infof("Device: %s is aleady marked unhealthy. Skip republishing resourceslice", uuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, how often could we see a log message like this?
What I see here right now: we can get the d.deviceHealthMonitor.Unhealthy() event multiple times, even if we had already processed that device before. I wonder how often we should expect that to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there can be events burst. @lalitadithya showed me logs of device-plugin xid errors in a cluster which clearly showed same event logged multiple times.
@lalitadithya is it possible to share the log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patience in waiting on a review @guptaNswati.
cmd/gpu-kubelet-plugin/driver.go
Outdated
| release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second)) | ||
| if err != nil { | ||
| klog.Errorf("error acquiring prep/unprep lock for health status update: %v", err) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means that we don't mark the device as unhealth in this case. Is that the intended behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not abort the event. Probably should just log the error and update the device status anyway..
cmd/gpu-kubelet-plugin/driver.go
Outdated
| klog.V(6).Info("Successfully republished resources without unhealthy device") | ||
| } | ||
|
|
||
| release() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we factor this logic into a function, then we could use a deferred call to release() after taking the lock. This may be less error prone if we do ever add code paths that return from this logic.
| &cli.StringFlag{ | ||
| Name: "additional-xids-to-ignore", | ||
| Usage: "A comma-separated list of additional XIDs to ignore.", | ||
| Value: "", | ||
| Destination: &flags.additionalXidsToIgnore, | ||
| EnvVars: []string{"ADDITIONAL_XIDs_TO_IGNORE"}, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In NVIDIA/k8s-device-plugin#1443 we added a list of EXPLICIT XIDs to consider fatal. This allows a user to:
- Specify ignored XIDs (including
all) - Specify SPECIFIC XIDs that are considered fatal (including
all).
The important thing here is that it allows users to override the list of hard-coded XIDs that we currently track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware of this and was planning to do this as a follow-up as it recently got merged.
| } | ||
| m.eventSet = eventSet | ||
|
|
||
| m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does allocatable change at all, or is it constant for the lifetime of the plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do update the health status of an allocatable when we get a unhealthy notification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that wasn't clear. Does the content of allocatable change in any way that would invalidate the map that we construct here meaning that it would need to be reconstructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the only changing content of an allocatable device is its health status and that wont impact it. This map ([uuid] = device) is constructed in the very beginning when the plugin is started.
and we iterate on it to register event on each device (currently, we dont check any status here. we may do it in future on remediation) and send unhealthy notification.
cmd/gpu-kubelet-plugin/driver.go
Outdated
| continue | ||
| } | ||
|
|
||
| release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Acquiring this lock here is a big decision.
Here, I really expect a concise / precise code comment explaining convincingly
- why this lock must be acquired
- how we guarantee that
release()is always called
Maybe start by explaining what you think will go wrong when we do not acquire this lock here at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need to double check that lock is released on any failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be mindful of acquiring this lock. Let's do it only for a strong reason. When we introduced that lock we named it prepare/unprepare lock because it's meant for that purpose.
Maybe we should use it here, too -- but let's pretty please thoroughly identify that strong reason, and put it into a few English sentences that are convincing.
I am not yet satisfied yet here by our arguments. We need to discuss the alternatives considered, I need more help please to understand why this is the correct approach (I really mean that -- it's not that I know what we should do -- but I sense that we don't, as a collective, understand yet what we really want to do here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about the reason that you vaguely describe:
This lock will prevent unhealthy device to be allocated in a simultaneous NodePrepare call().
And I wonder: will it?
Can you describe a sequence of events where acquiring the lock would actually make that incoming nodePrepareResources() call not allocate an unhealthy device?
Here, we only update the ResourceSlice to un-announce any unhealthy device, right?
The moment we're done with that, we release the lock and the unchanged nodePrepareResources() call (that was waiting for us, hanging in lock acquisition) proceeds, trying to get what it wanted to get anyway.
When the kubelet already wants us to allocate an unhealthy device, then updating the ResourceSlice won't undo that (that gun is "fired" so to say), at least not reliably. Is that correct?
Let's say kubelet sends a PrepareResourceClaims() call our way and that may potentially result in allocating an unhealthy device (because the unhealthy notification arrived very recently).
Then I believe if we want a safe method to prevent that from happening we need to have logic within nodePrepareResource(). Does that make sense? (I quite literally don't know yet, you all have thought more about this than I did).
Specifically, I am wondring: do we need to call this new IsHealthy() somewhere in
| func (s *DeviceState) prepareDevices(ctx context.Context, claim *resourceapi.ResourceClaim) (PreparedDevices, error) { |
We can always actively fail a prepareDevices() if the only matching device turns out to be unhealthy right before we would have allocated it.
Maybe this is already similar?
device, exists := s.allocatable[result.Device]
if !exists {
return nil, fmt.Errorf("requested device is not allocatable: %v", result.Device)
}
| device, exists := s.allocatable[result.Device] |
I might be completely off in what I say here. The point is that I need to convince myself that really we know what we're doing here and that we only acquire the PU lock if we absolutely have to -- because it has potentially devastating downsides to do it unnecessarily often.
This discussion is really important to align on, and I'd love for you all to help me understand why what we're doing here is the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. there has to be a check in prepareDevices() to actually fail the allocation for a unhealthy device. But we still need the lock as there are 2 things happening on a unhealthy event:
// device health status is updated
- d.state.UpdateDeviceHealthStatus(device, Unhealthy) => this update is imp for above check to be reliable
// ResourceSlice is republished
- if err := d.pluginhelper.PublishResources(ctx, resources); err != nil => for future allocations
acquiring the nodePrepare lock will make sure we are simultaneously not updating the health status and also allocating the same device.
i have updated the device_state.go to check device health status before proceeding with the allocation.
if featuregates.Enabled(featuregates.DeviceHealthCheck) {
if device.Health == Unhealthy {
return nil, fmt.Errorf("requested device is not healthy: %v", result.Device)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acquiring the nodePrepare lock will make sure we are simultaneously not updating the health status and also allocating the same device.
But, does it? I might just not see it -- which sequence of events did you imagine, maybe?
Let me try to re-frame the problem space that I believe you are thinking about when you say "simultaneously not updating the health status and also allocating the same device".
You want to make sure that we don't allocate a device that's knowingly unhealthy. Does that sound about right?
I would agree -- let's try to do that :)
What do you think about the following mental model?
- Let's agree that this is generally a best-effort problem space -- between the device becoming unhealthy and us knowing there is unpredictable amount of time passing.
- Let's agree that this is an event propagation problem space.
I think we can also say:
- Deep within
func (s *DeviceState) Prepare()we must know, as early as possible, when a device is unhealthy. That's our final event consumer. - Event producer and event propagation pipeline are orthogonal to that.
The best we can do here is that we perform event propagation at minimal latency, towards the consumer.
Detour on latency
Because I find it interesting.
Between a GPU actually becoming unhealthy and us calling UpdateDeviceHealthStatus() as little time as possible should pass.
That is something that we can do and should do in this PR: minimize the fraction of event propagation latency that we control here.
Zooming out, this is always going to be a best effort strategy. Right now we seem to subscribe to GPU events more or less directly (but already now the event propagates through layers: there's the physical device, there's NVML, and then there's our process, and other layers that I am not even aware of). In the future, with NVSentinel, we're talking about event propagation across even more components.
Generally, there's a timeline attached that is unpredictable and we want to make sure we minimize latency at all steps. Here is one way to maybe think about that timeline:
T_1) a GPU actually becoming unhealthy (the 'physical' event)
T_2) us detecting it in component A
T_3) emitting an event in component A towards component B
T_4) potential-black-box-event-propagation -> after all emitted towards our GPU kubelet plugin
T_5) responding to that incoming event in our GPU kubelet plugin
Let's agree on the following: there's always a chance that we call func (s *DeviceState) Prepare() for a device after T_1 and before T_5.
We may just want to make sure we respond to the unhealthy event in the moment we receive it. We want to make sure we propagate to all its consumers ASAP.
That propagation itself does not need to be lock-protected; it just must happen fast.
Misc
Tangential, but potentially a helpful perspective: the "protected" data structure here (for now) is just device.Health and it's only mutated from within UpdateDeviceHealthStatus().
Then, also tangential, I notice that there already is a bit of a synchronization in your current patch proposal: UpdateDeviceHealthStatus() acquires the mutex on the DeviceState singleton:
func (s *DeviceState) UpdateDeviceHealthStatus(device *AllocatableDevice, hs HealthStatus) {
s.Lock()
defer s.Unlock()
...
We acquire the same mutex in func (s *DeviceState) Prepare().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very descriptive. Thank you for the effort JP. Yes, this is the flow of events i imagined when i was convinced that we needed to have the lock when updating the device health status and republishing the ResourceSlice.
unhealthy event -> lock to prevent any other operation on the device (mark device unhealthy + republish RS) -> unlock -> device unhealthy = my logic
but as you pointed out, i already acquire the lock when updating device status so the above lock is not really needed for republish and this is all best effort anyway in avoiding a potential race from the T_1 to T_5.
cmd/gpu-kubelet-plugin/driver.go
Outdated
| release, err := d.pulock.Acquire(ctx, flock.WithTimeout(10*time.Second)) | ||
| if err != nil { | ||
| klog.Errorf("error acquiring prep/unprep lock for health status update: %v", err) | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, with continue we abort processing this event.
As a code reader, I want to get help here from a code comment -- to understand why it is okay to abort processing that. Will we re-process it later? How much later?
| defer s.Unlock() | ||
|
|
||
| device.Health = healthstatus | ||
| klog.Infof("Update device sattus:%s healthstatus", device.UUID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos, spaces etc.
Can you please explain / give an impression of how often this message would be logged? What actions/events will trigger this message to be logged?
When we understand that, let's have a brief think about a suitable verbosity level.
One of the questions that I have here: is this only logged for a health flip? Or could this also be logged for healthy->healthy? Should we have a different message/level depending on the state transition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now, its only logged when healthy becomes unhealthy. But in future when we have remediation, it will also change to unhealthy->healthy.
| { | ||
| Default: false, | ||
| PreRelease: featuregate.Alpha, | ||
| Version: version.MajorMinor(25, 8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't know -- why is 25, 8 what we want here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be (25, 12) based on dec release?
cmd/gpu-kubelet-plugin/driver.go
Outdated
| case <-ctx.Done(): | ||
| klog.V(6).Info("Stop processing device health notifications") | ||
| return | ||
| case device, ok := <-d.deviceHealthMonitor.Unhealthy(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about wrapping the logic in this case in a function and then call that function here. I think that would greatly help readability. It may also help with calling release() more reliably (if needed -- see discussion here).
Thanks to you for the review @elezar. Thanks to @jgehrcke also. The PR looks more lively :-D |
Signed-off-by: Swati Gupta <[email protected]>
43fcbc8 to
ae7211e
Compare
|
@elezar @jgehrcke i incorporated some of the suggestions, and some not which i dint think are critical. please review again and let me know if something is blocking. i will come back to it #689 (comment)
|
|
Thanks!
Let's for now assume that some things are blocking. Let us please carefully and mindfully drive each discussion thread to completion. I would really appreciate if you can drive that and make it easy for all of us by going through the questions with point-by-point replies. 📜 This is quite a bit of communication effort, but it will absolutely pay off when bringing the new code into production environments. |
Signed-off-by: Swati Gupta <[email protected]>
| klog.V(6).Info("Stopping event-driven GPU health monitor...") | ||
| return | ||
| default: | ||
| event, ret := m.eventSet.Wait(5000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to look this up to understand what it's doing.
Ref docs are here: https://docs.nvidia.com/deploy/nvml-api/group__nvmlEvents.html#group__nvmlEvents_1g9714b0ca9a34c7a7780f87fee16b205c
argument is timeout in ms.
errors to be handled:
| continue | ||
| } | ||
| if ret != nvml.SUCCESS { | ||
| klog.Infof("Error waiting for event: %v; Marking all devices as unhealthy", ret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem right.
Let us specifically handle the NVML_ERROR_GPU_IS_LOST case, and perform this 'nuke option' only then. Then we can also have a more precise log message (emitted on error level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only NVML_ERROR_GPU_IS_LOST If we are not just checking !NVML_SUCCESS, TIMEOUT and UNKNOWN also seems imp.
If you see the returns of all event methods, these all seem common error types. I can do a helper of errortype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/return.go#L42
but then we mark the device unhealthy in each case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while this will allow for proper error handling, this will also deviate from how its done in device-plugin. i can take it up as a follow-up to fix in both.
@jgehrcke thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging all the reviewers to help resolve this. Most other seems to be non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will also deviate from how its done in device-plugin.
Let's get this PR into the state we want it for the DRA driver without trying to match the device plugin implementation exactly. That is to say, consider this the next iteration of the device plugin implementation. The learnings that we take from this should be applied to the device plugin and iterated on from there.
One motivation is that this is a new feature and we don't have users currently expecting a certain behaviour. This gives us a lot more flexibility to change behaviour than if we had an existing implementation in use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implementation detail of how we want to handle NVML return errors so it wont impact the end user anyway whether we do it similar to device-plugin or improve it here. For user, all these error means unhealthy device and wont be published as part of ResourceSlice. And IMO, ret != nvml.SUCCESS is a valid check and covers a wide range of errors as all subsequent calls are dependent on this check.
For me, these make sense when we have proper remediation in place where based on the given error, the right action can be recommended.
| continue | ||
| } | ||
|
|
||
| klog.Infof("Processing event %+v", event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example for this log message, how it would look like in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I1027 18:03:39.337167 1 device_health.go:179] Processing event {Device:{Handle:0xe151b6b2fef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Signed-off-by: Swati Gupta <[email protected]>
| nvmlDeviceHealthMonitor, err := newNvmlDeviceHealthMonitor(ctx, config, state.allocatable, state.nvdevlib) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("start nvmlDeviceHealthMonitor: %w", err) | ||
| } | ||
|
|
||
| driver.nvmlDeviceHealthMonitor = nvmlDeviceHealthMonitor | ||
|
|
||
| driver.wg.Add(1) | ||
| go func() { | ||
| defer driver.wg.Done() | ||
| driver.deviceHealthEvents(ctx, config.flags.nodeName) | ||
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think splitting creation from starting the health monitor makes sense.
This would lend itself to the following interface for a generic DeviceHealthMonitor:
type DeviceHealthMonitor interface {
Start(context.Context) error
Stop() error
Unhealthy() <-chan *AllocatableDevice
}
And we can update the implementaiton here to:
| nvmlDeviceHealthMonitor, err := newNvmlDeviceHealthMonitor(ctx, config, state.allocatable, state.nvdevlib) | |
| if err != nil { | |
| return nil, fmt.Errorf("start nvmlDeviceHealthMonitor: %w", err) | |
| } | |
| driver.nvmlDeviceHealthMonitor = nvmlDeviceHealthMonitor | |
| driver.wg.Add(1) | |
| go func() { | |
| defer driver.wg.Done() | |
| driver.deviceHealthEvents(ctx, config.flags.nodeName) | |
| }() | |
| deviceHealthMonitor, err := newNvmlDeviceHealthMonitor(config, state.allocatable, state.nvdevlib) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to create device health monitor: %w", err) | |
| } | |
| if err := deviceHealthMonitor.Start(ctx); err != nil { | |
| return nil, fmt.Errorf("failed to start device health monitor: %w", err) | |
| } | |
| driver.deviceHealthMonitor = deviceHealthMonitor | |
| driver.wg.Add(1) | |
| go func() { | |
| defer driver.wg.Done() | |
| driver.deviceHealthEvents(ctx, config.flags.nodeName) | |
| }() |
Note that I have dropped the nvml prefix from everything except the function creating the monitor. This minimizes the changes when we add additional monitors.
We could even go so far as to ALSO rename newNvmlDeviceHealthMonitor to NewDeviceHealthMonitor (possibly with functional options) so that this code is alreay in the desired state (modulo additional constructor arguments).
| state *DeviceState | ||
| pulock *flock.Flock | ||
| healthcheck *healthcheck | ||
| nvmlDeviceHealthMonitor *nvmlDeviceHealthMonitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know that we will add new health monitors, let's rename the member to be more generic.
| nvmlDeviceHealthMonitor *nvmlDeviceHealthMonitor | |
| deviceHealthMonitor *nvmlDeviceHealthMonitor |
| func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) { | ||
| if nvdevlib.nvmllib == nil { | ||
| return nil, fmt.Errorf("nvml library is nil") | ||
| } | ||
|
|
||
| ctx, cancel := context.WithCancel(ctx) | ||
|
|
||
| m := &nvmlDeviceHealthMonitor{ | ||
| nvmllib: nvdevlib.nvmllib, | ||
| unhealthy: make(chan *AllocatableDevice, len(allocatable)), | ||
| cancelContext: cancel, | ||
| } | ||
|
|
||
| if ret := m.nvmllib.Init(); ret != nvml.SUCCESS { | ||
| cancel() | ||
| return nil, fmt.Errorf("failed to initialize NVML: %v", ret) | ||
| } | ||
|
|
||
| klog.V(6).Info("creating NVML events for device health monitor") | ||
| eventSet, ret := m.nvmllib.EventSetCreate() | ||
| if ret != nvml.SUCCESS { | ||
| _ = m.nvmllib.Shutdown() | ||
| cancel() | ||
| return nil, fmt.Errorf("failed to create event set: %w", ret) | ||
| } | ||
| m.eventSet = eventSet | ||
|
|
||
| m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable) | ||
|
|
||
| m.getDeviceByParentGiCiMap = getDeviceByParentGiCiMap(allocatable) | ||
|
|
||
| klog.V(6).Info("registering NVML events for device health monitor") | ||
| m.registerEventsForDevices() | ||
|
|
||
| skippedXids := m.xidsToSkip(config.flags.additionalXidsToIgnore) | ||
| klog.V(6).Info("started device health monitoring") | ||
| m.wg.Add(1) | ||
| go m.run(ctx, skippedXids) | ||
|
|
||
| return m, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned at the call site, I think it simplifies the implmentation if we split the construction of a monitor from actually starting it. What about updating this to:
| func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) { | |
| if nvdevlib.nvmllib == nil { | |
| return nil, fmt.Errorf("nvml library is nil") | |
| } | |
| ctx, cancel := context.WithCancel(ctx) | |
| m := &nvmlDeviceHealthMonitor{ | |
| nvmllib: nvdevlib.nvmllib, | |
| unhealthy: make(chan *AllocatableDevice, len(allocatable)), | |
| cancelContext: cancel, | |
| } | |
| if ret := m.nvmllib.Init(); ret != nvml.SUCCESS { | |
| cancel() | |
| return nil, fmt.Errorf("failed to initialize NVML: %v", ret) | |
| } | |
| klog.V(6).Info("creating NVML events for device health monitor") | |
| eventSet, ret := m.nvmllib.EventSetCreate() | |
| if ret != nvml.SUCCESS { | |
| _ = m.nvmllib.Shutdown() | |
| cancel() | |
| return nil, fmt.Errorf("failed to create event set: %w", ret) | |
| } | |
| m.eventSet = eventSet | |
| m.uuidToDeviceMap = getUUIDToDeviceMap(allocatable) | |
| m.getDeviceByParentGiCiMap = getDeviceByParentGiCiMap(allocatable) | |
| klog.V(6).Info("registering NVML events for device health monitor") | |
| m.registerEventsForDevices() | |
| skippedXids := m.xidsToSkip(config.flags.additionalXidsToIgnore) | |
| klog.V(6).Info("started device health monitoring") | |
| m.wg.Add(1) | |
| go m.run(ctx, skippedXids) | |
| return m, nil | |
| } | |
| func newNvmlDeviceHealthMonitor(config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) { | |
| if nvdevlib.nvmllib == nil { | |
| return nil, fmt.Errorf("nvml library is nil") | |
| } | |
| if ret := nvdevlib.nvmllib.Init(); ret != nvml.SUCCESS { | |
| return nil, fmt.Errorf("failed to initialize NVML: %v", ret) | |
| } | |
| defer func() { | |
| _ = nvdevlib.nvmllib.Shutdown() | |
| }() | |
| m := &nvmlDeviceHealthMonitor{ | |
| nvmllib: nvdevlib.nvmllib, | |
| unhealthy: make(chan *AllocatableDevice, len(allocatable)), | |
| uuidToDeviceMap: getUUIDToDeviceMap(allocatable), | |
| getDeviceByParentGiCiMap: getDeviceByParentGiCiMap(allocatable), | |
| skippedXids: xidsToSkip(config.flags.additionalXidsToIgnore), | |
| } | |
| return m, nil | |
| } | |
| func (m *nvmlDeviceHealthMonitor) Start(ctx context.Context) (rerr error) { | |
| if ret := m.nvmllib.Init(); ret != nvml.SUCCESS { | |
| return fmt.Errorf("failed to initialize NVML: %v", ret) | |
| } | |
| // We shutdown nvml if this function returns with an error. | |
| defer func() { | |
| if rerr != nil { | |
| _ = m.nvmllib.Shutdown() | |
| } | |
| }() | |
| klog.V(6).Info("creating NVML events for device health monitor") | |
| eventSet, ret := m.nvmllib.EventSetCreate() | |
| if ret != nvml.SUCCESS { | |
| return fmt.Errorf("failed to create event set: %w", ret) | |
| } | |
| ctx, cancel := context.WithCancel(ctx) | |
| m.cancelContext = cancel | |
| m.eventSet = eventSet | |
| klog.V(6).Info("registering NVML events for device health monitor") | |
| m.registerEventsForDevices() | |
| klog.V(6).Info("started device health monitoring") | |
| m.wg.Add(1) | |
| go m.run(ctx, m.skippedXids) | |
| return nil | |
| } |
Note that we now cleanly separate nvml errors that occur during setup and those that we have to handle while waiting for events.
| unhealthy chan *AllocatableDevice | ||
| cancelContext context.CancelFunc | ||
| uuidToDeviceMap map[string]*AllocatableDevice | ||
| getDeviceByParentGiCiMap map[string]map[uint32]map[uint32]*AllocatableDevice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this member since it's not a function. I would even go so far as to add a type:
type placementToAllocatableDeviceMap map[string]map[uint32]map[uint32]*AllocatableDevice
that we can attach functions to (get(string,uint32,uint32), update(string,uint32,uint32,*AllocatableDevice)) to simplify implementations bellow.
This would mean that we update the member definition to something like:
| getDeviceByParentGiCiMap map[string]map[uint32]map[uint32]*AllocatableDevice | |
| deviceByParentGiCiMap placementToAllocatableDeviceMap |
| wg sync.WaitGroup | ||
| } | ||
|
|
||
| func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my other comment(s), what about introducing a top-level factory method where we can add additional constrution logic. Something like:
| func newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) { | |
| func NewDeviceHealthMonitor(config *Config, allocatable AllocatableDevices, nvdevlib *devicelib) (DeviceHealthMonitor, error) { | |
| return newNvmlDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*nvmlDeviceHealthMonitor, error) | |
| } |
This may not look too important, but we could even add logic to instantiate a mock monitor based on an envvar (or move the feature flag logic from driver.go here and return a NULL (no-op) handler in the case where the feature is not enabled. This has the advantage of simplifying the callsite.
| eventSet nvml.EventSet | ||
| unhealthy chan *AllocatableDevice | ||
| cancelContext context.CancelFunc | ||
| uuidToDeviceMap map[string]*AllocatableDevice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need two maps? Are the entries in the more complete map below not a subset of this?
|
|
||
| m.wg.Wait() | ||
|
|
||
| _ = m.eventSet.Free() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is probably like this in the device plugin too, but should we at least log an error here?
| continue | ||
| } | ||
|
|
||
| if event.EventType != nvml.EventTypeXidCriticalError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also for the device plugin) Can we track the follow-up action of checking why we don't check other supported types? Do whe have any indication of whether we ever see the log message below?
| var affectedDevice *AllocatableDevice | ||
| pMap, ok1 := m.getDeviceByParentGiCiMap[eventUUID] | ||
| if ok1 { | ||
| giMap, ok2 := pMap[event.GpuInstanceId] | ||
| if ok2 { | ||
| affectedDevice = giMap[event.ComputeInstanceId] | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we define a type for this map, we could simplify this as:
| var affectedDevice *AllocatableDevice | |
| pMap, ok1 := m.getDeviceByParentGiCiMap[eventUUID] | |
| if ok1 { | |
| giMap, ok2 := pMap[event.GpuInstanceId] | |
| if ok2 { | |
| affectedDevice = giMap[event.ComputeInstanceId] | |
| } | |
| } | |
| affectedDevice := m.getDeviceByParentGiCiMap.get( | |
| eventUUID, | |
| event.GpuInstanceId, | |
| event.ComputeInstanceId, | |
| ) |
alternatively getting an element from an initialized map is "safe", so the following could also work:
affectedDevice := m.getDeviceByParentGiCiMap[eventUUID][event.GpuInstanceId][event.ComputeInstanceId]
| if _, ok := deviceByParentGiCiMap[parentUUID]; !ok { | ||
| deviceByParentGiCiMap[parentUUID] = make(map[uint32]map[uint32]*AllocatableDevice) | ||
| } | ||
| if _, ok := deviceByParentGiCiMap[parentUUID][giID]; !ok { | ||
| deviceByParentGiCiMap[parentUUID][giID] = make(map[uint32]*AllocatableDevice) | ||
| } | ||
| deviceByParentGiCiMap[parentUUID][giID][ciID] = d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we define a type for this map we could factor this into:
func (p placementToAllocatableDeviceMap) put(uuid string, gi uint32, ci uint32, d *AllocatableDevice) {
if _, ok := p[uuid]; !ok {
p[uuid] = make(map[uint32]map[uint32]*AllocatableDevice)
}
if _, ok := p[uuid][gi]; !ok {
p[uuid][gi] = make(map[uint32]*AllocatableDevice)
}
p[uuid][gi][ci] = d
}
and then replace the implementation here with:
| if _, ok := deviceByParentGiCiMap[parentUUID]; !ok { | |
| deviceByParentGiCiMap[parentUUID] = make(map[uint32]map[uint32]*AllocatableDevice) | |
| } | |
| if _, ok := deviceByParentGiCiMap[parentUUID][giID]; !ok { | |
| deviceByParentGiCiMap[parentUUID][giID] = make(map[uint32]*AllocatableDevice) | |
| } | |
| deviceByParentGiCiMap[parentUUID][giID][ciID] = d | |
| deviceByParentGiCiMap.put(parentUUID, giID, ciID, d) |
| for _, d := range allocatable { | ||
| var parentUUID string | ||
| var giID, ciID uint32 | ||
|
|
||
| switch d.Type() { | ||
| case GpuDeviceType: | ||
| parentUUID = d.UUID() | ||
| if parentUUID == "" { | ||
| continue | ||
| } | ||
| giID = FullGPUInstanceID | ||
| ciID = FullGPUInstanceID | ||
| case MigDeviceType: | ||
| parentUUID = d.Mig.parent.UUID | ||
| if parentUUID == "" { | ||
| continue | ||
| } | ||
| giID = d.Mig.giInfo.Id | ||
| ciID = d.Mig.ciInfo.Id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also move the put function below to the individual case statements:
| for _, d := range allocatable { | |
| var parentUUID string | |
| var giID, ciID uint32 | |
| switch d.Type() { | |
| case GpuDeviceType: | |
| parentUUID = d.UUID() | |
| if parentUUID == "" { | |
| continue | |
| } | |
| giID = FullGPUInstanceID | |
| ciID = FullGPUInstanceID | |
| case MigDeviceType: | |
| parentUUID = d.Mig.parent.UUID | |
| if parentUUID == "" { | |
| continue | |
| } | |
| giID = d.Mig.giInfo.Id | |
| ciID = d.Mig.ciInfo.Id | |
| for _, d := range allocatable { | |
| switch d.Type() { | |
| case GpuDeviceType: | |
| uuid := d.UUID() | |
| if uuid == "" { | |
| continue | |
| } | |
| deviceByParentGiCiMap.put(uuid, FullGPUInstanceID, FullGPUInstanceID) | |
| case MigDeviceType: | |
| uuid := d.Mig.parent.UUID | |
| if uuid == "" { | |
| continue | |
| } | |
| deviceByParentGiCiMap.put(uuid, d.Mig.giInfo.Id, d.Mig.ciInfo.Id) |
(we could even rename the put to something more meaningful and add the uuid == "" check there)
| // Add the list of hardcoded disabled (ignored) XIDs: | ||
| // http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4 | ||
| // Application errors: the GPU should still be healthy. | ||
| ignoredXids := []uint64{ | ||
| 13, // Graphics Engine Exception | ||
| 31, // GPU memory page fault | ||
| 43, // GPU stopped processing | ||
| 45, // Preemptive cleanup, due to previous errors | ||
| 68, // Video processor exception | ||
| 109, // Context Switch Timeout Error | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this list is taken from the device plugin, but handling them explicity at such a low-level is quite difficult to customize. Does it make sense to not port this logic over by instead define these as the default value for the envvar where we expose this to the user?
(I'm happy to do this as a follow-up though).
Just as a note for completeness. The GKE device plugin doesn't use this strategy for XIDs. They only element in their list is XID48 (see https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/0509b1f9f4b9a357b44ba65e7b508ded8bd5ecf0/pkg/gpu/nvidia/health_check/health_checker.go#L59).
Addressing #360 to add preliminary health check similar to https://github.com/NVIDIA/k8s-device-plugin.
Test logs:
TLDR, This device had an event and is not added
MIG-4d806f22-346a-5a1d-ac01-86b505cdf485The device is picked back when driver is restarted.