-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Labels
enhancementNew feature or requestNew feature or request
Description
What you would like to be added?
Grove in its current form does not support training jobs through the PodCliqueSet API. It inherently only supports inference workloads, since the PodClique controller is not designed to handle Pods that successfully exit after their workload finishes.
We need to enhance Grove in the following areas to support training:
- Add a
statefield to thePodCliqueSetStatuswhich embeds the phase of the workload, i.e.Pending,Running,Completed,Failed, etc. - Enhance the
PodCliquecontroller to be aware of job-like workloads' lifecycles, not just long running inference applications. The pods exiting successfully should not cause the controller to recreate Pods. - Specify a
maxRuntimefield in a relevant place in thePodCliqueSetto specify how long the user expects their training workload to finish in. The semantic of this field can be discussed on further - if the workload does not finish before this duration, we could fire alerts to inform the user, and/or mark the workload asFailed. - Support a
maxRestartsfield in a relevant place in thePodCliqueSetto indicate the number of restarts of a pod the operator must tolerate, before it marks the workload asFailed. - Disable rolling updates for workloads which are jobs, since it does not make much sense to roll your pods in such a scenario.
- Disable autoscaling that is supported at all levels, since the consumer should, and will know their workload and capacity before they start training.
Why is this needed?
Interest about grove, and its adoption is growing in the community. There are multiple features that are unique to grove, that the community would want to utilize not just for their inference workloads, but also for their training workloads.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request