Skip to content

Fix JobSet webhook rejection on TrainJob suspend by using merge patch#28

Closed
abhijeet-dhumal wants to merge 2 commits intoopendatahub-io:mainfrom
abhijeet-dhumal:fix-jobset-webhook-err
Closed

Fix JobSet webhook rejection on TrainJob suspend by using merge patch#28
abhijeet-dhumal wants to merge 2 commits intoopendatahub-io:mainfrom
abhijeet-dhumal:fix-jobset-webhook-err

Conversation

@abhijeet-dhumal
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal commented Nov 26, 2025

When attempting to suspend or resume a running TrainJob, the JobSet webhook rejects the update with:

admission webhook "vjobset.kb.io" denied the request:
spec.replicatedJobs: Invalid value: []v1alpha2.ReplicatedJob(nil): field is immutable

TrainJob suspend/resume blocked by JobSet webhook validation

Root cause:

  • Controller uses Server-Side Apply (SSA) to update JobSet
  • SSA sends ApplyConfiguration with all fields (including immutable ones)
  • JobSet webhook validates immutable fields haven't changed
  • Webhook rejects because it can't distinguish "unchanged" from "changed"
  • TrainJob suspend never propagates to JobSet → pods keep running

Solution:
When only suspend changes, use strategic merge patch instead of SSA:

  patch := client.MergeFrom(oldJobSet.DeepCopy())
  oldJobSet.Spec.Suspend = ptr.To(newSuspend)
  client.Patch(ctx, oldJobSet, patch)

This sends only the suspend field, bypassing immutable field validation.

  • Skips update when both old and new JobSets are unsuspended (running).
  • Reduces unnecessary reconciliation for active TrainJobs.

Tested with:

  • Suspend running TrainJob - pods terminate
  • Resume suspended TrainJob - pods recreate
  • Multiple suspend/resume cycles
  • Integration with Kueue workload suspension

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

…patch

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 26, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sutaakar
Copy link
Copy Markdown
Collaborator

@abhijeet-dhumal Can you please provide unit tests for your change?

@robert-bell
Copy link
Copy Markdown
Collaborator

@abhijeet-dhumal does this need fixing upstream too?

I'll let @astefanutti review the approach. I don't feel I understand this well enough.

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@sutaakar
Copy link
Copy Markdown
Collaborator

looks ok to me with very limited knowledge of context

@astefanutti
Copy link
Copy Markdown

I haven't been able to reproduce with Trainer 2.1.0, neither on OpenShift nor KinD.

@abhijeet-dhumal
Copy link
Copy Markdown
Member Author

Thanks a lot @astefanutti for testing this..
I tried reproducing same issue on main branch as well as on upstream release-2.1 version
Not sure on which version i built this operator image : quay.io/abdhumal/trainer:v2.1.0-rhai-progression-25nov
but on this specific image It shows that error repeatedly.. 👀

Since we couldn't reproduce this issue reliably, I'm closing this issue, Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants