fix: prevent scheduler crash when PodGroup deleted during session#4993
fix: prevent scheduler crash when PodGroup deleted during session#4993atharrva01 wants to merge 1 commit intovolcano-sh:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @atharrva01! It looks like this is your first PR to volcano-sh/volcano 🎉 |
Summary of ChangesHello @atharrva01, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the scheduler's robustness by resolving a race condition. Previously, the deletion of a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical race condition that could lead to a scheduler crash. The addition of the nil check for jobInCache.PodGroup is a good defensive measure.
I've added one comment regarding another potential nil pointer dereference in the same line of code that could still cause a panic. By addressing this as well, we can make the scheduler more robust against this class of issues.
Overall, this is a valuable fix for scheduler stability.
pkg/scheduler/cache/cache.go
Outdated
| job.Namespace, job.Name) | ||
| return | ||
| } | ||
| jobInCache.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode] = job.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode] |
There was a problem hiding this comment.
While the nil check for jobInCache.PodGroup is a great addition, there's another potential nil pointer dereference here. The GetAnnotations() method can return a nil map if no annotations are present.
- Right side:
job.PodGroup.GetAnnotations()could benil, causing a panic when indexed. - Left side:
jobInCache.PodGroup.GetAnnotations()could also benil, causing a panic on assignment.
This could lead to the same scheduler crash this PR aims to fix. We can use metav1.SetMetaDataAnnotation for a safer update and add a nil check for the source.
var annotationValue string
if annotations := job.PodGroup.GetAnnotations(); annotations != nil {
annotationValue = annotations[schedulingapi.JobAllocatedHyperNode]
}
metav1.SetMetaDataAnnotation(&jobInCache.PodGroup.ObjectMeta, schedulingapi.JobAllocatedHyperNode, annotationValue)There was a problem hiding this comment.
Gemini is right here, please add this too!
There was a problem hiding this comment.
hi @hajnalmt This update guards against nil annotations when syncing JobAllocatedHyperNode, preventing a potential scheduler crash and ensuring safe metadata updates.
hajnalmt
left a comment
There was a problem hiding this comment.
Thanks for the change!
We have a lot of nil pointer crash errors lately. Can you add Geminis suggestions too please?
pkg/scheduler/cache/cache.go
Outdated
| job.Namespace, job.Name) | ||
| return | ||
| } | ||
| jobInCache.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode] = job.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode] |
There was a problem hiding this comment.
Gemini is right here, please add this too!
Signed-off-by: atharrva01 <atharvaborade568@gmail.com>
3524d9b to
2227404
Compare
There was a problem hiding this comment.
Pull request overview
This pull request fixes a critical race condition that can cause the scheduler to panic when a PodGroup is deleted during an active scheduling session. The issue occurs when a session clones jobs with valid PodGroups, but the PodGroup is deleted by a user or controller before the session completes. When the session attempts to update job annotations, it crashes trying to access the now-nil PodGroup in the cache.
Changes:
- Added defensive nil check in
updateJobAnnotationsto skip annotation updates when PodGroup has been deleted - Changed from direct map assignment to safer
metav1.SetMetaDataAnnotationAPI - Added nil check for annotations map when reading from source job
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| klog.V(3).Infof("Skip updating annotations for job <%s/%s>: PodGroup was deleted", | ||
| job.Namespace, job.Name) | ||
| return | ||
| } |
There was a problem hiding this comment.
If job.PodGroup is nil, calling job.PodGroup.GetAnnotations() will panic. While sessions only clone jobs with non-nil PodGroups (see line 1508), adding a defensive nil check before accessing GetAnnotations() would improve robustness. Consider adding: if job.PodGroup == nil { return } before line 1666.
| } | |
| } | |
| if job.PodGroup == nil { | |
| klog.V(3).Infof("Skip updating annotations for job <%s/%s>: PodGroup is nil on job object", | |
| job.Namespace, job.Name) | |
| return | |
| } |
Impact
I found a race condition that can crash the scheduler with a nil pointer dereference when a
PodGroupis deleted while a scheduling session is still active. This leads to a full scheduler panic, halting all scheduling until the process is restarted. In production clusters (especially ML/HPC workloads using network topology), this can cause repeated outages under normal job churn.How to Reproduce
PodGroup)PodGroupis deleted by a user or controllerjobInCache.PodGroupupdateJobAnnotations()is calledjobInCache.PodGroup.GetAnnotations()This is a legitimate race between scheduler sessions and PodGroup lifecycle events.
Root Cause
updateJobAnnotations()assumes thatjobInCache.PodGroupis always non-nil.However, the cache can legally delete the PodGroup during an active scheduling session, leaving the cached job in a state where
PodGroup == nilwhile the session still attempts to update annotations.Fix
I added a defensive nil check before accessing
jobInCache.PodGroup.If the PodGroup has already been deleted, the annotation update is skipped since it is no longer meaningful and avoids crashing the scheduler.
This keeps scheduler behavior safe without changing scheduling semantics.
Why This Fix Is Safe
Test Results
No new tests were added since this is a defensive guard against a cross-component race that is difficult to reliably unit-test.