-
Notifications
You must be signed in to change notification settings - Fork 35
Add Scheduler Backend framework #293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Not sure I quite understand the goals of this, it's already possible to support different schedulers via the pod specs? (though Kai is the only gang scheduler currently working). I'd suggest kicking off work like this off with a github issue with plenty of detail and a discord discussion as well. |
be3e7f3 to
1475638
Compare
Sure, Let me create a new issue to introduce this. I can introduce some background here. This is a real request from one of our customer. We have some schedulers which want to integrate Grove. It would be great to have a unify scheduler backend. In that way, we can support other schedulers easily. Since we need to support multiple scheduler as backend especially we need to support k8s 1.34 workload API. Once we have this backend framework we can easily add new scheduler support like default-kube scheduler, Koordinator. In this PR I will only involve scheduler backend framework. For KAI scheduler backend, I won't change the currently workflow that means KAI will still handle podgang and create podgroups/pods. |
418038f to
206f953
Compare
a24032c to
e9ec287
Compare
Ronkahn21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! A few architectural points to consider:
-
Controller Responsibility: I don’t think the pcs-controller should be updating the PodGang status. Ideally, it should only handle the creation, leaving the podGang-controller to manage its own status.
-
Scaling & Performance: We should discuss the PodGang pod reference fields. Adding this to the pcs-controller increases its complexity. For better scalability, it might be better to let the PodGroup own the pod status before we move toward creating the backend API.
Since the API changes are currently out of scope, we can sync on this later. Amazing job overall, thanks!
| } | ||
|
|
||
| // checkAndRemovePodSchedulingGates removes scheduling gates from pods when their dependencies are satisfied | ||
| // checkAndRemovePodSchedulingGates removes scheduling gates from pods when PodGang is initialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what does it mean for podGang to be initialized, and why its is requirement for lifting the the scheduling gate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ronkahn21, I add some background and design in this PR description. And because we create an empty PodGang first, we want to decrease the number of PCLQ sync. We add a new status PodGangInitialized to indicate that the SchedulingGate can be removed once it be true.
| // Also verify that all PodGroups have enough podReferences to meet minReplicas | ||
| for _, pg := range podGang.Spec.PodGroups { | ||
| if int32(len(pg.PodReferences)) < pg.MinReplicas { | ||
| return false | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why it is a requirement for filtering the events?,
I am afraid will be missing event handleling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why it is a requirement for filtering the events?, I am afraid will be missing event handleling
For reconcilers that are registered to react to PodGang create/update events we should now consider looking at the PodGangInitialized condition with its value set to True to allow enqueue of events for the reconciler. I add it because I only want to handle event when we have a stable(enough replicas) status. If PodGangInitialized is false then it means PodGang isn't been completely create yet(without PodReferece), in this case we don't need to handle event I think.
| } | ||
|
|
||
| // updatePodGangWithPodReferences updates a PodGang with pod references and sets Initialized condition. | ||
| func (r _resource) updatePodGangWithPodReferences(sc *syncContext, podGang *groveschedulerv1alpha1.PodGang) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split the function to smaller function
It is to big with a lot of logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will do that
| sort.Slice(podReferences, func(i, j int) bool { | ||
| return podReferences[i].Name < podReferences[j].Name | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might need to change the api field to ignore order, to reduce case where sorting big pods list each time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might need to change the api field to ignore order, to reduce case where sorting big pods list each time
That's a good idea since we have customer which have over 500 pods. But in this case, we won't have the order of the pods. I am not sure is it acceptable?
| original := updatedPodGang.DeepCopy() | ||
| setInitializedCondition(updatedPodGang, metav1.ConditionTrue, "AllPodsCreated", "All pods have been created and references populated") | ||
| updatedPodGang.Status.Phase = groveschedulerv1alpha1.PodGangPhasePending // Set phase to Pending when initialized | ||
|
|
||
| if err := r.client.Status().Patch(sc.ctx, updatedPodGang, client.MergeFrom(original)); err != nil { | ||
| return groveerr.WrapError(err, | ||
| errCodeUpdatePodGang, | ||
| component.OperationSync, | ||
| fmt.Sprintf("Failed to update PodGang %s status", latestPodGang.Name), | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think it It PCS job to update PodGang status,
status should reflect podGang spec current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think it It PCS job to update PodGang status, status should reflect podGang spec current state.
Yes, I know we shouldn't update PodGang status, however, PodGang status is a required field, if I don't set it here, I can't even set the Condition. And I use Condition to indicate whether the PodGang has all required fields especially pod references
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the creation/update/deletion of PodGangs is done by the PCS reconciler. You are right that a single reconciler should have a primary resource which it reconciles and whose status it updates. We have in the past considered having a separate reconciler for PodGang and back then we went against that decision (diverted from the idiomatic responsibility set for a reconciler) due to increased collaboration between reconcilers.
In any case i would say it is out of scope of this PR to consider creating a separate reconciler for PodGang. Maybe we discuss this outside this PR.
| // Update status with Initialized condition and Phase (only during initial setup) | ||
| original := updatedPodGang.DeepCopy() | ||
| setInitializedCondition(updatedPodGang, metav1.ConditionTrue, "AllPodsCreated", "All pods have been created and references populated") | ||
| updatedPodGang.Status.Phase = groveschedulerv1alpha1.PodGangPhasePending // Set phase to Pending when initialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be defined by the podGang controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really have a PodGang controller now? I think currently PCS controller is taking over PodGang's lifecycle management
| // updatePodGangsWithPodReferences updates PodGangs with pod references after Pods are created. | ||
| // Sets PodGangInitialized condition to True when all pods are present. | ||
| func (r _resource) updatePodGangsWithPodReferences(sc *syncContext) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missing some point, instead of creating the podGang and then update it with the PodGangInitialized, when the pods are created. we could simply create it only when all the pod are create, they could be protected by scheduled Gate ? what is the motivation for the current behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missing some point, instead of creating the podGang and then update it with the PodGangInitialized, when the pods are created. we could simply create it only when all the pod are create, they could be protected by scheduled Gate ? what is the motivation for the current behavior
I add some background in the PR's description. Since if other scheduler need to create CR like Workload based on PodGang. And Workload creation don't need to wait until the PodGang has all the Pods' references since Workload don't need to add Pods' references. So create an empty PodGang without Pods' references could trigger Workload creation in time
| component.KindPodGang, // Create/Update PodGang (handles both initial creation and updating with pod references) | ||
| component.KindPodClique, // Create PodClique (which creates Pods) | ||
| component.KindPodCliqueScalingGroup, // Create PCSG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for changing order ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because PodGang don't need to rely on pods creation now. We could create PodGang first. In this way, customer CR could also create in time based on PodGang
|
|
||
| const ( | ||
| // BackendName is the name of the KAI backend | ||
| BackendName = "kai" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KAI
| BackendName = "kai" | |
| BackendName = "KAI-Scheduler" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
|
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
c90b6bd to
10cf479
Compare
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
…in e2e Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
What type of PR is this?
In order to support different scheduler as backends we modify Grove and import scheduler backend interface
What this PR does / why we need it:
In the current PodGang component's sync flow we do the following:
PodGangresource.So you can see
PodGangwill be created after pods. However, there is a problem with upcomingWorkloadAPI support andkube-schedulerbackend.We don't want break current
PodGangworking flow. We import this scheduler backend framework to leave theWorkloadmanagement work to scheduler backend in Grove. For other scheduler, scheduler backend in Grove may manage different CR based onPodGang(Just like KAI, it will createPodGroups. In the future, we will move this management from KAI scheduler to Grove scheduler backend).To create a
Workloadobject, you will need to createPodGangresource. ThePodGangresource cannot be created before the Pods have been created and have a back reference to the PodGang. The issue is that only after theWorkloadobject is created will thekube-schedulerchoose to run scoring/filtering plugins to reserve node capacity to schedule this workload PodGroups. The Pods need to have a reference to theWorkloadobject in their spec.So to accommodate
WorkloadAPI the flow needs to be changed as below in the PodGang component:PodGangwith PodGroups(having emptyPodReferencesas none will exist at this point) andInitializedcondition set toFalse.PodGangwill trigger the creation of theWorkloadobject in the schedulerbackend reconciler which will use thekubescheduler backend.PodGanghasInitializedcondition set toTrue. - done in the PCLQ reconciler.Which issue(s) this PR fixes:
Fixes #275
Special notes for your reviewer:
Does this PR introduce a API change?
Yes. We will introduce a new API
SchedulerBackendAdditional documentation e.g., enhancement proposals, usage docs, etc.: