Skip to content

Support for the execution policy API in JobSet #672

Open
@andreyvelich

Description

What would you like to be added:

The new executionPolicy API which allows to submit replicated Jobs in order.
When the first replicated Jobs are the reached required condition, the next replicated Jobs are created.

Note. The complex DAG workflow capability is out of scope of this API, since we don't want to implement workflow functionality as part of this KEP. Users should consider to use Argo Workflows or Tekton Pipelines if they need it.

The initial API design:

type JobSetSpec struct {
	ExecutionPolicy *ExecutionPolicy `json:"executionPolicy,omitempty"`
}

type ExecutionPolicy struct {
	// Order in which Jobs will be created. The default is AnyOrder.
	ExecutionPolicyOrder ExecutionPolicyOption `json:"executionPolicyOrder"`

	// After all replicated Jobs reach this status, the JobSet will create the next replicated Jobs.
	ReplicatedJobsStatus ReplicatedJobsStatusOption `json:"replicatedJobsStatus"`
}

type ExecutionPolicyOption string

const (
	AnyOrder ExecutionPolicyOption = "AnyOrder"

	InOrder ExecutionPolicyOption = "InOrder"
)

type ReplicatedJobsStatusOption string

// We don't add Ready condition here, since users can use the `startupPolicy` API for that.
const (
	ReadyStatus ReplicatedJobsStatusOption = "Succeeded"

	FailedStatus ReplicatedJobsStatusOption = "Failed"

	ActiveStatus ReplicatedJobsStatusOption = "Active"

	SuspendedStatus ReplicatedJobsStatusOption = "Suspended"
)

Why is this needed:

More context in this Kubernetes wg-batch thread: https://kubernetes.slack.com/archives/C032ZE66A2X/p1725400839102729

As part of the Kubeflow Training V2 APIs, we want to implement the LLM runtimes for LLMs fine-tuning: kubeflow/training-operator#2170
That will require JobSet to orchestrate the sequence of 2-3 Jobs: Initializer -> Trainer -> Post-Processor.
The capacity management for such workload should be allocated for all Jobs combined and be controlled by Kueue.
When TrainJob is suspended, we will suspend all underlying Jobs.

I think, we might have more use-cases from the HPC side. Any thoughts @vsoch @alculquicondor ?

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

cc @tenzen-y @kannon92 @ahg-g @johnugeorge @akshaychitneni @shravan-achar

Metadata

Assignees

Labels

kind/api-changeCategorizes issue or PR as related to adding, removing, or otherwise changing an APIkind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions