fix: prevent scheduler crash when PodGroup deleted during session by atharrva01 · Pull Request #4993 · volcano-sh/volcano

atharrva01 · 2026-01-25T19:33:24Z

Impact

I found a race condition that can crash the scheduler with a nil pointer dereference when a PodGroup is deleted while a scheduling session is still active. This leads to a full scheduler panic, halting all scheduling until the process is restarted. In production clusters (especially ML/HPC workloads using network topology), this can cause repeated outages under normal job churn.

How to Reproduce

A scheduling session starts and clones jobs from the cache (job has a non-nil PodGroup)
While the session is still running, the PodGroup is deleted by a user or controller
Cache processes the delete event and unsets jobInCache.PodGroup
The session continues using its cloned copy and updates job annotations
Session closes and updateJobAnnotations() is called
Scheduler panics when accessing jobInCache.PodGroup.GetAnnotations()

This is a legitimate race between scheduler sessions and PodGroup lifecycle events.

Root Cause

updateJobAnnotations() assumes that jobInCache.PodGroup is always non-nil.
However, the cache can legally delete the PodGroup during an active scheduling session, leaving the cached job in a state where PodGroup == nil while the session still attempts to update annotations.

Fix

I added a defensive nil check before accessing jobInCache.PodGroup.
If the PodGroup has already been deleted, the annotation update is skipped since it is no longer meaningful and avoids crashing the scheduler.

if jobInCache.PodGroup == nil {
    klog.V(3).Infof(
        "Skip updating annotations for job <%s/%s>: PodGroup was deleted",
        job.Namespace, job.Name,
    )
    return
}

This keeps scheduler behavior safe without changing scheduling semantics.

Why This Fix Is Safe

The PodGroup no longer exists, so updating its annotations has no effect
Behavior remains unchanged for valid jobs
Prevents an unrecoverable scheduler panic
Minimal and localized change (~5 lines)

Test Results

No new tests were added since this is a defensive guard against a cross-component race that is difficult to reliably unit-test.

volcano-sh-bot · 2026-01-25T19:33:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-01-25T19:33:36Z

Welcome @atharrva01! It looks like this is your first PR to volcano-sh/volcano 🎉

gemini-code-assist · 2026-01-25T19:33:37Z

Summary of Changes

Hello @atharrva01, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the scheduler's robustness by resolving a race condition. Previously, the deletion of a PodGroup during an active scheduling session could cause a nil pointer dereference and a subsequent scheduler crash. The implemented fix adds a crucial nil check, ensuring the scheduler can gracefully handle such concurrent events without panicking, thereby improving overall system stability.

Highlights

Scheduler Stability Improvement: Addresses a critical race condition that could lead to a scheduler panic (nil pointer dereference) if a PodGroup is deleted while a scheduling session is active, ensuring the scheduler remains operational.
Defensive Nil Check Implementation: Introduces a defensive nil check for jobInCache.PodGroup within the updateJobAnnotations function, preventing crashes when the associated PodGroup has been concurrently deleted.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

atharrva01 · 2026-01-25T19:34:45Z

hi @hajnalmt @hwdef , This adds a defensive nil check to avoid a scheduler panic when a PodGroup is deleted during an active scheduling session.

gemini-code-assist

Code Review

This pull request addresses a critical race condition that could lead to a scheduler crash. The addition of the nil check for jobInCache.PodGroup is a good defensive measure.

I've added one comment regarding another potential nil pointer dereference in the same line of code that could still cause a panic. By addressing this as well, we can make the scheduler more robust against this class of issues.

Overall, this is a valuable fix for scheduler stability.

gemini-code-assist · 2026-01-25T19:37:22Z

pkg/scheduler/cache/cache.go

+				job.Namespace, job.Name)
+			return
+		}
 		jobInCache.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode] = job.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode]


While the nil check for jobInCache.PodGroup is a great addition, there's another potential nil pointer dereference here. The GetAnnotations() method can return a nil map if no annotations are present.

Right side: job.PodGroup.GetAnnotations() could be nil, causing a panic when indexed.

Left side: jobInCache.PodGroup.GetAnnotations() could also be nil, causing a panic on assignment.

This could lead to the same scheduler crash this PR aims to fix. We can use metav1.SetMetaDataAnnotation for a safer update and add a nil check for the source.

var annotationValue string if annotations := job.PodGroup.GetAnnotations(); annotations != nil { annotationValue = annotations[schedulingapi.JobAllocatedHyperNode] } metav1.SetMetaDataAnnotation(&jobInCache.PodGroup.ObjectMeta, schedulingapi.JobAllocatedHyperNode, annotationValue)

Gemini is right here, please add this too!

hi @hajnalmt This update guards against nil annotations when syncing JobAllocatedHyperNode, preventing a potential scheduler crash and ensuring safe metadata updates.

hajnalmt

Thanks for the change!
We have a lot of nil pointer crash errors lately. Can you add Geminis suggestions too please?

hajnalmt · 2026-02-05T07:23:14Z

pkg/scheduler/cache/cache.go

+				job.Namespace, job.Name)
+			return
+		}
 		jobInCache.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode] = job.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode]


Gemini is right here, please add this too!

Signed-off-by: atharrva01 <atharvaborade568@gmail.com>

Copilot

Pull request overview

This pull request fixes a critical race condition that can cause the scheduler to panic when a PodGroup is deleted during an active scheduling session. The issue occurs when a session clones jobs with valid PodGroups, but the PodGroup is deleted by a user or controller before the session completes. When the session attempts to update job annotations, it crashes trying to access the now-nil PodGroup in the cache.

Changes:

Added defensive nil check in updateJobAnnotations to skip annotation updates when PodGroup has been deleted
Changed from direct map assignment to safer metav1.SetMetaDataAnnotation API
Added nil check for annotations map when reading from source job

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-20T21:16:16Z

pkg/scheduler/cache/cache.go

+			klog.V(3).Infof("Skip updating annotations for job <%s/%s>: PodGroup was deleted",
+				job.Namespace, job.Name)
+			return
+		}


If job.PodGroup is nil, calling job.PodGroup.GetAnnotations() will panic. While sessions only clone jobs with non-nil PodGroups (see line 1508), adding a defensive nil check before accessing GetAnnotations() would improve robustness. Consider adding: if job.PodGroup == nil { return } before line 1666.

Suggested change

}

}

if job.PodGroup == nil {

klog.V(3).Infof("Skip updating annotations for job <%s/%s>: PodGroup is nil on job object",

job.Namespace, job.Name)

return

}

volcano-sh-bot requested review from hajnalmt and hwdef January 25, 2026 19:33

volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 25, 2026

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

hajnalmt suggested changes Feb 5, 2026

View reviewed changes

fix: prevent scheduler crash when PodGroup deleted during session

2227404

Signed-off-by: atharrva01 <atharvaborade568@gmail.com>

atharrva01 force-pushed the fix/nil-podgroup-crash branch from 3524d9b to 2227404 Compare February 20, 2026 21:12

Copilot AI review requested due to automatic review settings February 20, 2026 21:12

volcano-sh-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 20, 2026

Copilot started reviewing on behalf of atharrva01 February 20, 2026 21:13 View session

Copilot AI reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: prevent scheduler crash when PodGroup deleted during session#4993

fix: prevent scheduler crash when PodGroup deleted during session#4993
atharrva01 wants to merge 1 commit intovolcano-sh:masterfrom
atharrva01:fix/nil-podgroup-crash

atharrva01 commented Jan 25, 2026

Uh oh!

volcano-sh-bot commented Jan 25, 2026

Uh oh!

volcano-sh-bot commented Jan 25, 2026

Uh oh!

gemini-code-assist bot commented Jan 25, 2026

Uh oh!

atharrva01 commented Jan 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 25, 2026

Uh oh!

hajnalmt Feb 5, 2026

Uh oh!

atharrva01 Feb 20, 2026

Uh oh!

hajnalmt left a comment

Uh oh!

hajnalmt Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-		}
+		}
+		if job.PodGroup == nil {
+			klog.V(3).Infof("Skip updating annotations for job <%s/%s>: PodGroup is nil on job object",
+				job.Namespace, job.Name)
+			return
+		}

Comments

Conversation

atharrva01 commented Jan 25, 2026

Impact

How to Reproduce

Root Cause

Fix

Why This Fix Is Safe

Test Results

Uh oh!

volcano-sh-bot commented Jan 25, 2026

Uh oh!

volcano-sh-bot commented Jan 25, 2026

Uh oh!

gemini-code-assist bot commented Jan 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

atharrva01 commented Jan 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

atharrva01 Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

hajnalmt left a comment

Choose a reason for hiding this comment

Uh oh!

hajnalmt Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants