Support @parallel on Kubernetes with jobsets #1744

shrinandj · 2024-02-19T17:56:34Z

This commit adds support for @parallel when flows are run --with kubernetes Support for Argo workflows will follow in a separate commit.

A user can run a flow with the following:

@step
def start(self):
    self.next(self.parallel_step, num_parallel=3)

@kubernetes(cpu=1, memory=512)
@parallel
@step
def parallel_step(self):
...

Testing Done:

Ran a flow with @parallel on Kubernetes. Verified that it works correctly
Ran a flow without @parallel on Kubernetes. Verified that it works as expected.
Verified that jobsets based @parallel step gets scaled down if user kills it with a Ctrl-C

romain-intel · 2024-02-29T08:57:54Z

Mergeable anytime from my end -- no impact on core.

This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Testing Done: - Ran a flow with @parallel on Kubernetes. Verified that it works correctly - Ran a flow without @parallel on Kubernetes. Verified that it works as expected. - Verified that jobsets based @parallel step gets scaled down if user kills it with a Ctrl-C

shrinandj · 2024-03-04T16:42:29Z

There will be more testing done on this PR once #1745 gets merged. Distributed training workloads use an open port communicating with each other.

Linked to upstream : [Netflix#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Testing Done: - Ran a flow with @parallel on Kubernetes. Verified that it works correctly - Ran a flow without @parallel on Kubernetes. Verified that it works as expected. - Verified that jobsets based @parallel step gets scaled down if user kills it with a Ctrl-C

Implementation originates from [Netflix#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Testing Done: - Ran a flow with @parallel on Kubernetes. Verified that it works correctly - Ran a flow without @parallel on Kubernetes. Verified that it works as expected. - Verified that jobsets based @parallel step gets scaled down if user kills it with a Ctrl-C Changes to original Implementation: - pass down ports of Jobsets - Ensured that `ubf_context` is set correctly - Ensured that `split-index` is set correctly based on the type of task (control vs worker) - Fix bug in incorrect RANK setting. In the earlier implementation, we were setting `parallelism` to created `replicatedJobs`. - In this implementation, we create a different copy of the job for each replicated worker. - So retrieving the rank based on the Kubernetes V1EnvVar.valueFrom (metadata.annotations['batch.kubernetes.io/job-completion-index']) wont work. - since `job-completion-index` relies on setting `parallelism` on the `job_spec`. Instead now we just statically set the `RANK` based on the index in the iterator defining the jobs. Link To Upstream : [COMING SOON!]

Implementation originates from [Netflix#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Testing Done: - Ran a flow with @parallel on Kubernetes. Verified that it works correctly - Ran a flow without @parallel on Kubernetes. Verified that it works as expected. - Verified that jobsets based @parallel step gets scaled down if user kills it with a Ctrl-C Changes to original Implementation: - pass down ports of Jobsets - Ensured that `ubf_context` is set correctly - Ensured that `split-index` is set correctly based on the type of task (control vs worker) - Fix bug in incorrect RANK setting. In the earlier implementation, we were setting `parallelism` to created `replicatedJobs`. - In this implementation, we create a different copy of the job for each replicated worker. - So retrieving the rank based on the Kubernetes V1EnvVar.valueFrom (metadata.annotations['batch.kubernetes.io/job-completion-index']) wont work. - since `job-completion-index` relies on setting `parallelism` on the `job_spec`. Instead now we just statically set the `RANK` based on the index in the iterator defining the jobs.

Implementation originates from [Netflix#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ...

@Retry

Implementation originates from [Netflix#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Some notes about the implementation: - No annotations for task-id in pods since We cannot dynamically construct the task-id during K8s container runtime. - @catch is currently not supported with @parallel on kubernetes - metadata about jobset name exists in the task-metadata - The jobset will contain two job definitions; One for control and one for worker. - The worker will have n-1 replicas created. - We construct the worker task-id determininstically using naming conventions and shell hacking. - Jobset is considered running even if one job amongst all of them are running. - @Retry will work with jobset

@Retry

Implementation originates from [Netflix#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Some notes about the implementation: - No annotations for task-id in pods since We cannot dynamically construct the task-id during K8s container runtime. - @catch is currently not supported with @parallel on kubernetes - metadata about jobset name exists in the task-metadata - The jobset will contain two job definitions; One for control and one for worker. - The worker will have n-1 replicas created. - We construct the worker task-id determininstically using naming conventions and shell hacking. - Jobset is considered running even if one job amongst all of them are running. - @Retry will work with jobset - num_parallel <=1 will NOT be supported to start with; - One core reason is that jobsets don't allow setting replicas to 0; - jobsets controller will mutate a jobset with replica set to 0 with replicas set to 1.

@Retry

Implementation originates from [Netflix#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Some notes about the implementation: - No annotations for task-id in pods since We cannot dynamically construct the task-id during K8s container runtime. - @catch is currently not supported with @parallel on kubernetes - metadata about jobset name exists in the task-metadata - The jobset will contain two job definitions; One for control and one for worker. - The worker will have n-1 replicas created. - We construct the worker task-id determininstically using naming conventions and shell hacking. - Jobset is considered running even if one job amongst all of them are running. - @Retry will work with jobset - num_parallel <=1 will NOT be supported to start with; - One core reason is that jobsets don't allow setting replicas to 0; - jobsets controller will mutate a jobset with replica set to 0 with replicas set to 1. - The implementation accounts for Jobset CRD schema from v0.2.0 - Jobset team changed the schema (just renaming values) after v0.3.0 - The changes were to `replicatedJobsStatus` where certain fields were added and `ReplicatedJobsStatus` was renamed to `replicatedJobsStatus`

@Retry

Implementation originates from [#1744] This commit adds support for @parallel when flows are run `--with kubernetes` Support for Argo workflows will follow in a separate commit. A user can run a flow with the following: @step def start(self): self.next(self.parallel_step, num_parallel=3) @kubernetes(cpu=1, memory=512) @parallel @step def parallel_step(self): ... Some notes about the implementation: - No annotations for task-id in pods since We cannot dynamically construct the task-id during K8s container runtime. - @catch is currently not supported with @parallel on kubernetes - metadata about jobset name exists in the task-metadata - The jobset will contain two job definitions; One for control and one for worker. - The worker will have n-1 replicas created. - We construct the worker task-id determininstically using naming conventions and shell hacking. - Jobset is considered running even if one job amongst all of them are running. - @Retry will work with jobset - num_parallel <=1 will NOT be supported to start with; - One core reason is that jobsets don't allow setting replicas to 0; - jobsets controller will mutate a jobset with replica set to 0 with replicas set to 1. - The implementation accounts for Jobset CRD schema from v0.2.0 - Jobset team changed the schema (just renaming values) after v0.3.0 - The changes were to `replicatedJobsStatus` where certain fields were added and `ReplicatedJobsStatus` was renamed to `replicatedJobsStatus`

shrinandj closed this Feb 19, 2024

shrinandj force-pushed the shri/support-jobsets branch from f391613 to 986e10e Compare February 19, 2024 18:07

shrinandj reopened this Feb 19, 2024

shrinandj force-pushed the shri/support-jobsets branch from 76b6310 to 00720cc Compare March 4, 2024 16:38

valayDave mentioned this pull request Apr 18, 2024

[@parallel on Kubernetes] support for Jobsets #1804

Merged

shrinandj closed this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support @parallel on Kubernetes with jobsets #1744

Support @parallel on Kubernetes with jobsets #1744

Uh oh!

shrinandj commented Feb 19, 2024

Uh oh!

romain-intel commented Feb 29, 2024

Uh oh!

shrinandj commented Mar 4, 2024

Uh oh!

Uh oh!

Support @parallel on Kubernetes with jobsets #1744

Support @parallel on Kubernetes with jobsets #1744

Uh oh!

Conversation

shrinandj commented Feb 19, 2024

Uh oh!

romain-intel commented Feb 29, 2024

Uh oh!

shrinandj commented Mar 4, 2024

Uh oh!

Uh oh!