-
Notifications
You must be signed in to change notification settings - Fork 22
Pci device plugin fix orphan goroutine and remove unused channel #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Pci device plugin fix orphan goroutine and remove unused channel #67
Conversation
|
I traced the original The heartBeat will trigger kubevirt to refresh the permitted devices. In device_plugin part, it uses The structure is one |
@Yu-Jack, Thanks for your explanation, IIUC, in the current implementation IMHO, since this PR doesn't introduce any structural or control flow change, I'd like to remove the unused code path. WDYT? cc @bk201 @ibrokethecloud , thanks. |
|
Just leave link to original comment #66 (comment) if we don't want to remove existing codes. |
Yu-Jack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to delete stop channel. This stop is a control flag to control device plugin. For example, if we remove this stop, it can't return in health check function. Instead, we can remove done channel in this case.
|
Looks like one channel is enough. we could either use done or stop to trigger cleanup. that channel should be used in the healthcheck. this is legacy stuff which we copied over when we tried to setup the device plugins. Thanks for the PR @WebberHuang1118 to help clean this up. |
Signed-off-by: Webber Huang <[email protected]>
10da8fc to
e414b9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request fixes an orphan goroutine issue in the PCI device plugin lifecycle management. Previously, when a PCI device plugin was disabled, a goroutine would remain blocked on an unused stop channel that was never closed, causing a goroutine leak.
Changes:
- Removed the unused
stopchannel parameter from device plugin lifecycle methods - Removed blocking on the
stopchannel in the device plugin startup goroutine - Cleaned up import ordering in vgpu_controller.go
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/deviceplugins/device_manager.go | Removed stop channel field and updated Start() and SetStarted() method signatures to not accept stop channel parameter; removed dp.stop case from ListAndWatch select statement |
| pkg/controller/pcideviceclaim/pcideviceclaim_controller.go | Removed creation of unused stop channel and blocking <-stop statement that caused orphan goroutine |
| pkg/controller/gpudevice/vgpu_controller.go | Reordered imports to follow standard grouping convention |
Comments suppressed due to low confidence (1)
pkg/deviceplugins/device_manager.go:372
- The healthCheck() method is missing a case for the
dp.donechannel in its select statement. When stopDevicePlugin() closesdp.done, the healthCheck goroutine will not exit properly, which can lead to goroutine leaks. This is critical because healthCheck is spawned in ListAndWatch() and needs to terminate when the device plugin stops.
The select statement should include a case for <-dp.done that returns nil, similar to how VGPUDevicePlugin.performCheck() handles both dp.stop and dp.done in vgpu_device_manager.go:324-325.
for {
select {
case err := <-watcher.Errors:
logger.Reason(err).Errorf("error watching devices and device plugin directory")
case event := <-watcher.Events:
logger.V(4).Infof("health Event: %v", event)
if monDevID, exist := monitoredDevices[event.Name]; exist {
// Health in this case is if the device path actually exists
if event.Op == fsnotify.Create {
logger.Infof("monitored device %s appeared", dp.resourceName)
dp.health <- deviceHealth{
DevID: monDevID,
Health: pluginapi.Healthy,
}
} else if (event.Op == fsnotify.Remove) || (event.Op == fsnotify.Rename) {
logger.Infof("monitored device %s disappeared", dp.resourceName)
dp.health <- deviceHealth{
DevID: monDevID,
Health: pluginapi.Unhealthy,
}
}
} else if event.Name == dp.socketPath && event.Op == fsnotify.Remove {
logger.Infof("device socket file for device %s was removed, kubelet probably restarted.", dp.resourceName)
return nil
}
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Webber Huang <[email protected]>
Problem:
After a PCI device plugin is disabled, there is Go routine (invoked as the device plugin started) blocked by chan stop, however, chan stop is never touched, this brings an orphan Go routine once a PCI device is enabled and disabled.
Solution:
PCI device plugin fix the orphan goroutine and removes unused channel
Related Issue:
harvester/harvester#5179
Additional Context:
The CodeFactor
Complex Methoderror could be fixed once PR #66 merged