-
Notifications
You must be signed in to change notification settings - Fork 42.1k
fix: prevent panic in device manager when allocating devices in parallel #136022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: prevent panic in device manager when allocating devices in parallel #136022
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
|
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @ERJavier! |
|
Hi @ERJavier. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ERJavier The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Fix race condition in Allocate() where m.devicesToReuse[podUID] could be deleted by another goroutine while still being accessed after the RPC call. The issue occurs when: - Pod1 starts allocation and sets m.devicesToReuse[pod1.uid] - Pod1 makes RPC call (which can take time) - Pod2 starts allocation and deletes m.devicesToReuse[pod1.uid] - Pod1 tries to access m.devicesToReuse[pod1.uid] after RPC, causing panic The fix saves a local reference to devicesToReuse before the RPC call and uses it throughout the function, preventing the panic. Added TestDeviceAllocateParallel to reproduce and verify the fix.
59227c7 to
6d73c62
Compare
| m.devicesToReuse[string(pod.UID)] = make(map[string]sets.Set[string]) | ||
| podUID := string(pod.UID) | ||
| if _, ok := m.devicesToReuse[podUID]; !ok { | ||
| m.devicesToReuse[podUID] = make(map[string]sets.Set[string]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modification of this place is meaningless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i apologize for my misunderstanding, i will be better next time.
What this PR does
Fixes race condition in device manager's
Allocate()function that causes a panic when multiple pods allocate devices in parallel.Problem
When allocating devices in parallel:
m.devicesToReuse[pod1.uid]and makes an RPC callm.devicesToReuse[pod1.uid]while Pod1's RPC is still runningm.devicesToReuse[pod1.uid]after RPC returns, causing a nil pointer panicSolution
Save a local reference to
devicesToReusefor the current pod before the RPC call and use it throughout the function, preventing the race condition.Testing
TestDeviceAllocateParalleltest case that reproduces the issue-note
Fix race condition in device manager that caused panic when multiple pods allocate devices in parallel
Fixes kubelet panic when allcoate gpu parallelly #136021
/sig node
/kind bug