Skip to content

Enhance grove to support training jobs #285

@renormalize

Description

@renormalize

What you would like to be added?

Grove in its current form does not support training jobs through the PodCliqueSet API. It inherently only supports inference workloads, since the PodClique controller is not designed to handle Pods that successfully exit after their workload finishes.

We need to enhance Grove in the following areas to support training:

  • Add a state field to the PodCliqueSetStatus which embeds the phase of the workload, i.e. Pending, Running, Completed, Failed, etc.
  • Enhance the PodClique controller to be aware of job-like workloads' lifecycles, not just long running inference applications. The pods exiting successfully should not cause the controller to recreate Pods.
  • Specify a maxRuntime field in a relevant place in the PodCliqueSetto specify how long the user expects their training workload to finish in. The semantic of this field can be discussed on further - if the workload does not finish before this duration, we could fire alerts to inform the user, and/or mark the workload as Failed.
  • Support a maxRestarts field in a relevant place in the PodCliqueSet to indicate the number of restarts of a pod the operator must tolerate, before it marks the workload as Failed.
  • Disable rolling updates for workloads which are jobs, since it does not make much sense to roll your pods in such a scenario.
  • Disable autoscaling that is supported at all levels, since the consumer should, and will know their workload and capacity before they start training.

Why is this needed?

Interest about grove, and its adoption is growing in the community. There are multiple features that are unique to grove, that the community would want to utilize not just for their inference workloads, but also for their training workloads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions