Skip to content

Comments

fix: prevent scheduler crash when PodGroup deleted during session#4993

Open
atharrva01 wants to merge 1 commit intovolcano-sh:masterfrom
atharrva01:fix/nil-podgroup-crash
Open

fix: prevent scheduler crash when PodGroup deleted during session#4993
atharrva01 wants to merge 1 commit intovolcano-sh:masterfrom
atharrva01:fix/nil-podgroup-crash

Conversation

@atharrva01
Copy link

Impact

I found a race condition that can crash the scheduler with a nil pointer dereference when a PodGroup is deleted while a scheduling session is still active. This leads to a full scheduler panic, halting all scheduling until the process is restarted. In production clusters (especially ML/HPC workloads using network topology), this can cause repeated outages under normal job churn.


How to Reproduce

  1. A scheduling session starts and clones jobs from the cache (job has a non-nil PodGroup)
  2. While the session is still running, the PodGroup is deleted by a user or controller
  3. Cache processes the delete event and unsets jobInCache.PodGroup
  4. The session continues using its cloned copy and updates job annotations
  5. Session closes and updateJobAnnotations() is called
  6. Scheduler panics when accessing jobInCache.PodGroup.GetAnnotations()

This is a legitimate race between scheduler sessions and PodGroup lifecycle events.


Root Cause

updateJobAnnotations() assumes that jobInCache.PodGroup is always non-nil.
However, the cache can legally delete the PodGroup during an active scheduling session, leaving the cached job in a state where PodGroup == nil while the session still attempts to update annotations.


Fix

I added a defensive nil check before accessing jobInCache.PodGroup.
If the PodGroup has already been deleted, the annotation update is skipped since it is no longer meaningful and avoids crashing the scheduler.

if jobInCache.PodGroup == nil {
    klog.V(3).Infof(
        "Skip updating annotations for job <%s/%s>: PodGroup was deleted",
        job.Namespace, job.Name,
    )
    return
}

This keeps scheduler behavior safe without changing scheduling semantics.


Why This Fix Is Safe

  • The PodGroup no longer exists, so updating its annotations has no effect
  • Behavior remains unchanged for valid jobs
  • Prevents an unrecoverable scheduler panic
  • Minimal and localized change (~5 lines)

Test Results

image

No new tests were added since this is a defensive guard against a cross-component race that is difficult to reliably unit-test.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Contributor

Welcome @atharrva01! It looks like this is your first PR to volcano-sh/volcano 🎉

@gemini-code-assist
Copy link

Summary of Changes

Hello @atharrva01, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the scheduler's robustness by resolving a race condition. Previously, the deletion of a PodGroup during an active scheduling session could cause a nil pointer dereference and a subsequent scheduler crash. The implemented fix adds a crucial nil check, ensuring the scheduler can gracefully handle such concurrent events without panicking, thereby improving overall system stability.

Highlights

  • Scheduler Stability Improvement: Addresses a critical race condition that could lead to a scheduler panic (nil pointer dereference) if a PodGroup is deleted while a scheduling session is active, ensuring the scheduler remains operational.
  • Defensive Nil Check Implementation: Introduces a defensive nil check for jobInCache.PodGroup within the updateJobAnnotations function, preventing crashes when the associated PodGroup has been concurrently deleted.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 25, 2026
@atharrva01
Copy link
Author

hi @hajnalmt @hwdef , This adds a defensive nil check to avoid a scheduler panic when a PodGroup is deleted during an active scheduling session.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical race condition that could lead to a scheduler crash. The addition of the nil check for jobInCache.PodGroup is a good defensive measure.

I've added one comment regarding another potential nil pointer dereference in the same line of code that could still cause a panic. By addressing this as well, we can make the scheduler more robust against this class of issues.

Overall, this is a valuable fix for scheduler stability.

job.Namespace, job.Name)
return
}
jobInCache.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode] = job.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While the nil check for jobInCache.PodGroup is a great addition, there's another potential nil pointer dereference here. The GetAnnotations() method can return a nil map if no annotations are present.

  1. Right side: job.PodGroup.GetAnnotations() could be nil, causing a panic when indexed.
  2. Left side: jobInCache.PodGroup.GetAnnotations() could also be nil, causing a panic on assignment.

This could lead to the same scheduler crash this PR aims to fix. We can use metav1.SetMetaDataAnnotation for a safer update and add a nil check for the source.

		var annotationValue string
		if annotations := job.PodGroup.GetAnnotations(); annotations != nil {
			annotationValue = annotations[schedulingapi.JobAllocatedHyperNode]
		}
		metav1.SetMetaDataAnnotation(&jobInCache.PodGroup.ObjectMeta, schedulingapi.JobAllocatedHyperNode, annotationValue)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini is right here, please add this too!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @hajnalmt This update guards against nil annotations when syncing JobAllocatedHyperNode, preventing a potential scheduler crash and ensuring safe metadata updates.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change!
We have a lot of nil pointer crash errors lately. Can you add Geminis suggestions too please?

job.Namespace, job.Name)
return
}
jobInCache.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode] = job.PodGroup.GetAnnotations()[schedulingapi.JobAllocatedHyperNode]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini is right here, please add this too!

Signed-off-by: atharrva01 <atharvaborade568@gmail.com>
@atharrva01 atharrva01 force-pushed the fix/nil-podgroup-crash branch from 3524d9b to 2227404 Compare February 20, 2026 21:12
Copilot AI review requested due to automatic review settings February 20, 2026 21:12
@volcano-sh-bot volcano-sh-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 20, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a critical race condition that can cause the scheduler to panic when a PodGroup is deleted during an active scheduling session. The issue occurs when a session clones jobs with valid PodGroups, but the PodGroup is deleted by a user or controller before the session completes. When the session attempts to update job annotations, it crashes trying to access the now-nil PodGroup in the cache.

Changes:

  • Added defensive nil check in updateJobAnnotations to skip annotation updates when PodGroup has been deleted
  • Changed from direct map assignment to safer metav1.SetMetaDataAnnotation API
  • Added nil check for annotations map when reading from source job

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

klog.V(3).Infof("Skip updating annotations for job <%s/%s>: PodGroup was deleted",
job.Namespace, job.Name)
return
}
Copy link

Copilot AI Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If job.PodGroup is nil, calling job.PodGroup.GetAnnotations() will panic. While sessions only clone jobs with non-nil PodGroups (see line 1508), adding a defensive nil check before accessing GetAnnotations() would improve robustness. Consider adding: if job.PodGroup == nil { return } before line 1666.

Suggested change
}
}
if job.PodGroup == nil {
klog.V(3).Infof("Skip updating annotations for job <%s/%s>: PodGroup is nil on job object",
job.Namespace, job.Name)
return
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants