feat: support multiple replicas for non-trainer replicatedJobs#3284
feat: support multiple replicas for non-trainer replicatedJobs#3284krishdef7 wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
1fcc355 to
4ac0c79
Compare
There was a problem hiding this comment.
Pull request overview
This PR partially implements issue #2318, adding support for multiple replicas in non-trainer replicatedJobs of a TrainingRuntime. It fixes a bug where pod counts were computed assuming replicas=1, allowing e.g. a DatasetInitializer job with replicas=3 to correctly scale the PodGroup MinMember and set per-replica Parallelism/Completions.
Changes:
trainingruntime.go: Computecount = parallelism × replicasfor non-trainer jobs; trainer usesNumNodesdirectly and ignores thereplicasmultiplier.jobset.go: InBuild(), split Parallelism/Completions assignment: trainer sets them directly fromps.Count; non-trainer dividesps.Countbyreplicasto recover the per-replica value.builder.go: Change unconditionalReplicas = 1overwrite to a nil-guard (if Replicas == nil { Replicas = 1 }), preserving any replicas already set from the runtime template.trainingruntime_test.go: Two new test cases validating the multi-replica non-trainer flow and the trainer-ignores-replicas behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
pkg/runtime/core/trainingruntime.go |
Multiply parallelism × replicas for the pod count; trainer case still overrides to NumNodes |
pkg/runtime/framework/plugins/jobset/jobset.go |
Split trainer/non-trainer Parallelism assignment; non-trainer divides total count back by replicas |
pkg/runtime/framework/plugins/jobset/builder.go |
Preserve existing Replicas from template instead of unconditionally overwriting with 1 |
pkg/runtime/core/trainingruntime_test.go |
New tests for multi-replica DatasetInitializer and trainer-replicas-ignored scenario |
abcf46b to
83489fb
Compare
|
@andreyvelich @tenzen-y, wanted to flag this for your review when you get a chance. The core change is in three files: trainingruntime.go (multiply parallelism × replicas for non-trainer jobs), builder.go (nil-guard instead of unconditional Replicas=1), and jobset.go (split trainer/non-trainer Parallelism assignment). Two new test cases cover the multi-replica and trainer-ignores-replicas scenarios. |
Support .template.spec.replicatedJobs[*].replicas > 1 to allow multiple replicated Jobs instead of a single Job with thousands of completions, which causes kube-controller-manager memory leaks and reconciliation delays at scale. Changes: - trainingruntime.go: read replicas from each replicatedJob and multiply with parallelism for PodGroup MinMember count; trainer ancestor uses NumNodes directly and is unaffected - builder.go: preserve Replicas field from runtime template instead of unconditionally overwriting with 1 - jobset.go: split Parallelism/Completions assignment in Build() — trainer uses count directly, non-trainer divides by replicas to get per-replica value Note: endpoint generation in IdentifyPodNetwork for multi-replica non-trainer jobs is tracked separately; initializer jobs do not participate in training network topology so this is currently harmless. Fixes kubeflow#2318 Signed-off-by: krishdef7 <gargkrish06@gmail.com>
83489fb to
44d118d
Compare
Description
Resolves #2318 (partial).
Currently,
newRuntimeInfo()intrainingruntime.goderives pod count only fromTemplate.Spec.Parallelism, effectively assumingreplicas=1. This PR reads.replicasfrom each replicatedJob and uses it to correctly compute pod counts and configure the JobSet.Changes
replicasfrom each replicatedJob and multiply withparallelismfor PodGroupMinMembercount. The trainer ancestor case usesNumNodesdirectly and is unaffected.Replicasfield from the runtime template instead of unconditionally overwriting with1.Parallelism/Completionsassignment inBuild()- trainer usescountdirectly, non-trainer divides byreplicasto get the per-replica value.Testing
Added two test cases to
TestTrainingRuntimeNewObjects:replicas=3: verifies pod count multiplication, correctParallelism=1,Replicas=3, andMinMember=14replicas=4andNumNodes=5: verifiesNumNodestakes precedence (MinMember=7, not5*4=20)Notes
Endpoint generation in
IdentifyPodNetworkfor multi-replica non-trainer jobs is not addressed here. Initializer jobs do not participate in training network topology, so this is currently harmless and tracked by the existing TODO comment referencing #2318.