feat: support multiple replicas for non-trainer replicatedJobs by krishdef7 · Pull Request #3284 · kubeflow/trainer

krishdef7 · 2026-03-06T16:48:57Z

Description

Resolves #2318 (partial).

Currently, newRuntimeInfo() in trainingruntime.go derives pod count only from Template.Spec.Parallelism, effectively assuming replicas=1. This PR reads .replicas from each replicatedJob and uses it to correctly compute pod counts and configure the JobSet.

Changes

trainingruntime.go: Read replicas from each replicatedJob and multiply with parallelism for PodGroup MinMember count. The trainer ancestor case uses NumNodes directly and is unaffected.
builder.go: Preserve the Replicas field from the runtime template instead of unconditionally overwriting with 1.
jobset.go: Split Parallelism/Completions assignment in Build() - trainer uses count directly, non-trainer divides by replicas to get the per-replica value.

Testing

Added two test cases to TestTrainingRuntimeNewObjects:

Non-trainer replicatedJob with replicas=3: verifies pod count multiplication, correct Parallelism=1, Replicas=3, and MinMember=14
Trainer replicatedJob with replicas=4 and NumNodes=5: verifies NumNodes takes precedence (MinMember=7, not 5*4=20)

Notes

Endpoint generation in IdentifyPodNetwork for multi-replica non-trainer jobs is not addressed here. Initializer jobs do not participate in training network topology, so this is currently harmless and tracked by the existing TODO comment referencing #2318.

google-oss-prow · 2026-03-06T16:49:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-03-06T16:49:06Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

krishdef7 · 2026-03-06T16:51:38Z

/cc @andreyvelich @tenzen-y

Copilot

Pull request overview

This PR partially implements issue #2318, adding support for multiple replicas in non-trainer replicatedJobs of a TrainingRuntime. It fixes a bug where pod counts were computed assuming replicas=1, allowing e.g. a DatasetInitializer job with replicas=3 to correctly scale the PodGroup MinMember and set per-replica Parallelism/Completions.

Changes:

trainingruntime.go: Compute count = parallelism × replicas for non-trainer jobs; trainer uses NumNodes directly and ignores the replicas multiplier.
jobset.go: In Build(), split Parallelism/Completions assignment: trainer sets them directly from ps.Count; non-trainer divides ps.Count by replicas to recover the per-replica value.
builder.go: Change unconditional Replicas = 1 overwrite to a nil-guard (if Replicas == nil { Replicas = 1 }), preserving any replicas already set from the runtime template.
trainingruntime_test.go: Two new test cases validating the multi-replica non-trainer flow and the trainer-ignores-replicas behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`pkg/runtime/core/trainingruntime.go`	Multiply `parallelism × replicas` for the pod count; trainer case still overrides to `NumNodes`
`pkg/runtime/framework/plugins/jobset/jobset.go`	Split trainer/non-trainer Parallelism assignment; non-trainer divides total count back by replicas
`pkg/runtime/framework/plugins/jobset/builder.go`	Preserve existing `Replicas` from template instead of unconditionally overwriting with `1`
`pkg/runtime/core/trainingruntime_test.go`	New tests for multi-replica DatasetInitializer and trainer-replicas-ignored scenario

pkg/runtime/core/trainingruntime_test.go

krishdef7 · 2026-03-11T12:55:32Z

@andreyvelich @tenzen-y, wanted to flag this for your review when you get a chance. The core change is in three files: trainingruntime.go (multiply parallelism × replicas for non-trainer jobs), builder.go (nil-guard instead of unconditional Replicas=1), and jobset.go (split trainer/non-trainer Parallelism assignment). Two new test cases cover the multi-replica and trainer-ignores-replicas scenarios.

Support .template.spec.replicatedJobs[*].replicas > 1 to allow multiple replicated Jobs instead of a single Job with thousands of completions, which causes kube-controller-manager memory leaks and reconciliation delays at scale. Changes: - trainingruntime.go: read replicas from each replicatedJob and multiply with parallelism for PodGroup MinMember count; trainer ancestor uses NumNodes directly and is unaffected - builder.go: preserve Replicas field from runtime template instead of unconditionally overwriting with 1 - jobset.go: split Parallelism/Completions assignment in Build() — trainer uses count directly, non-trainer divides by replicas to get per-replica value Note: endpoint generation in IdentifyPodNetwork for multi-replica non-trainer jobs is tracked separately; initializer jobs do not participate in training network topology so this is currently harmless. Fixes kubeflow#2318 Signed-off-by: krishdef7 <gargkrish06@gmail.com>

Copilot AI review requested due to automatic review settings March 6, 2026 16:48

google-oss-prow bot requested review from akshaychitneni and jinchihe March 6, 2026 16:49

google-oss-prow bot added the size/L label Mar 6, 2026

Copilot started reviewing on behalf of krishdef7 March 6, 2026 16:49 View session

krishdef7 force-pushed the feature/multi-replica-replicatedjobs branch from 1fcc355 to 4ac0c79 Compare March 6, 2026 16:49

google-oss-prow bot requested review from andreyvelich and tenzen-y March 6, 2026 16:51

Copilot AI reviewed Mar 6, 2026

View reviewed changes

pkg/runtime/core/trainingruntime_test.go Show resolved Hide resolved

krishdef7 force-pushed the feature/multi-replica-replicatedjobs branch 2 times, most recently from abcf46b to 83489fb Compare March 6, 2026 17:32

krishdef7 force-pushed the feature/multi-replica-replicatedjobs branch from 83489fb to 44d118d Compare March 14, 2026 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support multiple replicas for non-trainer replicatedJobs#3284

feat: support multiple replicas for non-trainer replicatedJobs#3284
krishdef7 wants to merge 1 commit intokubeflow:masterfrom
krishdef7:feature/multi-replica-replicatedjobs

krishdef7 commented Mar 6, 2026

Uh oh!

google-oss-prow bot commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

krishdef7 commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

krishdef7 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krishdef7 commented Mar 6, 2026

Description

Changes

Testing

Notes

Uh oh!

google-oss-prow bot commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

krishdef7 commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

krishdef7 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants