Description
What would you like to be added:
The new executionPolicy
API which allows to submit replicated Jobs in order.
When the first replicated Jobs are the reached required condition, the next replicated Jobs are created.
Note. The complex DAG workflow capability is out of scope of this API, since we don't want to implement workflow functionality as part of this KEP. Users should consider to use Argo Workflows or Tekton Pipelines if they need it.
The initial API design:
type JobSetSpec struct {
ExecutionPolicy *ExecutionPolicy `json:"executionPolicy,omitempty"`
}
type ExecutionPolicy struct {
// Order in which Jobs will be created. The default is AnyOrder.
ExecutionPolicyOrder ExecutionPolicyOption `json:"executionPolicyOrder"`
// After all replicated Jobs reach this status, the JobSet will create the next replicated Jobs.
ReplicatedJobsStatus ReplicatedJobsStatusOption `json:"replicatedJobsStatus"`
}
type ExecutionPolicyOption string
const (
AnyOrder ExecutionPolicyOption = "AnyOrder"
InOrder ExecutionPolicyOption = "InOrder"
)
type ReplicatedJobsStatusOption string
// We don't add Ready condition here, since users can use the `startupPolicy` API for that.
const (
ReadyStatus ReplicatedJobsStatusOption = "Succeeded"
FailedStatus ReplicatedJobsStatusOption = "Failed"
ActiveStatus ReplicatedJobsStatusOption = "Active"
SuspendedStatus ReplicatedJobsStatusOption = "Suspended"
)
Why is this needed:
More context in this Kubernetes wg-batch
thread: https://kubernetes.slack.com/archives/C032ZE66A2X/p1725400839102729
As part of the Kubeflow Training V2 APIs, we want to implement the LLM runtimes for LLMs fine-tuning: kubeflow/training-operator#2170
That will require JobSet to orchestrate the sequence of 2-3 Jobs: Initializer -> Trainer -> Post-Processor.
The capacity management for such workload should be allocated for all Jobs combined and be controlled by Kueue.
When TrainJob
is suspended, we will suspend all underlying Jobs.
I think, we might have more use-cases from the HPC side. Any thoughts @vsoch @alculquicondor ?
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.
cc @tenzen-y @kannon92 @ahg-g @johnugeorge @akshaychitneni @shravan-achar