Skip to content

feat: support multiple replicas for non-trainer replicatedJobs#3284

Open
krishdef7 wants to merge 1 commit intokubeflow:masterfrom
krishdef7:feature/multi-replica-replicatedjobs
Open

feat: support multiple replicas for non-trainer replicatedJobs#3284
krishdef7 wants to merge 1 commit intokubeflow:masterfrom
krishdef7:feature/multi-replica-replicatedjobs

Conversation

@krishdef7
Copy link
Contributor

Description

Resolves #2318 (partial).

Currently, newRuntimeInfo() in trainingruntime.go derives pod count only from Template.Spec.Parallelism, effectively assuming replicas=1. This PR reads .replicas from each replicatedJob and uses it to correctly compute pod counts and configure the JobSet.

Changes

  • trainingruntime.go: Read replicas from each replicatedJob and multiply with parallelism for PodGroup MinMember count. The trainer ancestor case uses NumNodes directly and is unaffected.
  • builder.go: Preserve the Replicas field from the runtime template instead of unconditionally overwriting with 1.
  • jobset.go: Split Parallelism/Completions assignment in Build() - trainer uses count directly, non-trainer divides by replicas to get the per-replica value.

Testing

Added two test cases to TestTrainingRuntimeNewObjects:

  • Non-trainer replicatedJob with replicas=3: verifies pod count multiplication, correct Parallelism=1, Replicas=3, and MinMember=14
  • Trainer replicatedJob with replicas=4 and NumNodes=5: verifies NumNodes takes precedence (MinMember=7, not 5*4=20)

Notes

Endpoint generation in IdentifyPodNetwork for multi-replica non-trainer jobs is not addressed here. Initializer jobs do not participate in training network topology, so this is currently harmless and tracked by the existing TODO comment referencing #2318.

Copilot AI review requested due to automatic review settings March 6, 2026 16:48
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@krishdef7 krishdef7 force-pushed the feature/multi-replica-replicatedjobs branch from 1fcc355 to 4ac0c79 Compare March 6, 2026 16:49
@krishdef7
Copy link
Contributor Author

/cc @andreyvelich @tenzen-y

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR partially implements issue #2318, adding support for multiple replicas in non-trainer replicatedJobs of a TrainingRuntime. It fixes a bug where pod counts were computed assuming replicas=1, allowing e.g. a DatasetInitializer job with replicas=3 to correctly scale the PodGroup MinMember and set per-replica Parallelism/Completions.

Changes:

  • trainingruntime.go: Compute count = parallelism × replicas for non-trainer jobs; trainer uses NumNodes directly and ignores the replicas multiplier.
  • jobset.go: In Build(), split Parallelism/Completions assignment: trainer sets them directly from ps.Count; non-trainer divides ps.Count by replicas to recover the per-replica value.
  • builder.go: Change unconditional Replicas = 1 overwrite to a nil-guard (if Replicas == nil { Replicas = 1 }), preserving any replicas already set from the runtime template.
  • trainingruntime_test.go: Two new test cases validating the multi-replica non-trainer flow and the trainer-ignores-replicas behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/runtime/core/trainingruntime.go Multiply parallelism × replicas for the pod count; trainer case still overrides to NumNodes
pkg/runtime/framework/plugins/jobset/jobset.go Split trainer/non-trainer Parallelism assignment; non-trainer divides total count back by replicas
pkg/runtime/framework/plugins/jobset/builder.go Preserve existing Replicas from template instead of unconditionally overwriting with 1
pkg/runtime/core/trainingruntime_test.go New tests for multi-replica DatasetInitializer and trainer-replicas-ignored scenario

@krishdef7 krishdef7 force-pushed the feature/multi-replica-replicatedjobs branch 2 times, most recently from abcf46b to 83489fb Compare March 6, 2026 17:32
@krishdef7
Copy link
Contributor Author

@andreyvelich @tenzen-y, wanted to flag this for your review when you get a chance. The core change is in three files: trainingruntime.go (multiply parallelism × replicas for non-trainer jobs), builder.go (nil-guard instead of unconditional Replicas=1), and jobset.go (split trainer/non-trainer Parallelism assignment). Two new test cases cover the multi-replica and trainer-ignores-replicas scenarios.

Support .template.spec.replicatedJobs[*].replicas > 1 to allow
multiple replicated Jobs instead of a single Job with thousands of
completions, which causes kube-controller-manager memory leaks and
reconciliation delays at scale.

Changes:
- trainingruntime.go: read replicas from each replicatedJob and
  multiply with parallelism for PodGroup MinMember count; trainer
  ancestor uses NumNodes directly and is unaffected
- builder.go: preserve Replicas field from runtime template instead
  of unconditionally overwriting with 1
- jobset.go: split Parallelism/Completions assignment in Build() —
  trainer uses count directly, non-trainer divides by replicas to
  get per-replica value

Note: endpoint generation in IdentifyPodNetwork for multi-replica
non-trainer jobs is tracked separately; initializer jobs do not
participate in training network topology so this is currently harmless.

Fixes kubeflow#2318

Signed-off-by: krishdef7 <gargkrish06@gmail.com>
@krishdef7 krishdef7 force-pushed the feature/multi-replica-replicatedjobs branch from 83489fb to 44d118d Compare March 14, 2026 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KEP-2170: Support hundreds and thousands worker nodes for a single training Job

2 participants