|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "Introducing JobSet" |
| 4 | +date: 2025-02-03 |
| 5 | +slug: introducing-jobset |
| 6 | +draft: true |
| 7 | +--- |
| 8 | + |
| 9 | +**Authors**: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat) |
| 10 | + |
| 11 | +In this article, we introduce [JobSet](https://jobset.sigs.k8s.io/), an open source API for |
| 12 | +representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML |
| 13 | +training and HPC workloads on Kubernetes. |
| 14 | + |
| 15 | +## Why JobSet? |
| 16 | + |
| 17 | +The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML |
| 18 | +engineers who have found it to be a natural fit for the requirements of running distributed training |
| 19 | +workloads. |
| 20 | + |
| 21 | +Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a |
| 22 | +single host are often distributed across tens of thousands of accelerator chips, which in turn may |
| 23 | +span thousands of hosts. |
| 24 | + |
| 25 | +As such, the model training code is often containerized and executed simultaneously on all these |
| 26 | +hosts, performing distributed computations which often shard both the model parameters and/or the |
| 27 | +training dataset across the target accelerator chips, using communication collective primitives like |
| 28 | +all-gather and all-reduce to perform distributed computations and synchronize gradients between |
| 29 | +hosts. |
| 30 | + |
| 31 | +These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently |
| 32 | +scheduling and managing the lifecycle of containerized applications across a cluster of compute |
| 33 | +resources is an area where it shines. |
| 34 | + |
| 35 | +It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and |
| 36 | +controllers which manage the behavior and life cycle of these objects, allowing engineers to develop |
| 37 | +custom distributed training orchestration solutions to fit their needs. |
| 38 | + |
| 39 | +However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do |
| 40 | +not adequately model them alone anymore. |
| 41 | + |
| 42 | +Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become |
| 43 | +fragmented, and each of the existing solutions in this fragmented landscape has certain limitations |
| 44 | +that make it non-optimal for distributed ML training. |
| 45 | + |
| 46 | +For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g. |
| 47 | +PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit |
| 48 | +specifically to the target framework, each with different semantics and behavior. |
| 49 | + |
| 50 | +On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed |
| 51 | +completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of |
| 52 | +the most recent enhancements. However, running ML training and HPC workloads using the upstream Job |
| 53 | +API requires extra orchestration to fill the following gaps: |
| 54 | + |
| 55 | +Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different |
| 56 | +Pods are part of the same workload, but they need to run a different container, request different |
| 57 | +resources or have different failure policies. A common example is the driver-worker pattern. |
| 58 | + |
| 59 | +Job groups : Large scale training workloads span multiple network topologies, running across |
| 60 | +multiple racks for example. Such workloads are network latency sensitive, and aim to localize |
| 61 | +communication and minimize traffic crossing the higher-latency network links. To facilitate this, |
| 62 | +the workload needs to be split into groups of Pods each assigned to a network topology. |
| 63 | + |
| 64 | +Inter-Pod communication : Create and manage the resources (e.g. [headless |
| 65 | +Services](/docs/concepts/services-networking/service/#headless-services)) necessary to establish |
| 66 | +communication between the Pods of a job. |
| 67 | + |
| 68 | +Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is |
| 69 | +expected to start first (like Ray or Spark), in other cases the workers are expected to be ready |
| 70 | +before starting the driver (like MPI). |
| 71 | + |
| 72 | +JobSet aims to address those gaps using the Job API as a building block to build a richer API for |
| 73 | +large-scale distributed HPC and ML use cases. |
| 74 | + |
| 75 | +## How JobSet Works |
| 76 | +JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to |
| 77 | +easily specify different pod templates for different distinct groups of pods (e.g. a leader, |
| 78 | +workers, parameter servers, etc.). |
| 79 | + |
| 80 | +It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is |
| 81 | +essentially a Job Template with some desired number of Job replicas specified. This provides a |
| 82 | +declarative way to easily create identical child-jobs to run on different islands of accelerators, |
| 83 | +without resorting to scripting or Helm charts to generate many versions of the same job but with |
| 84 | +different names. |
| 85 | + |
| 86 | +{{< figure src="jobset_diagram.svg" alt="JobSet Architecture" class="diagram-large" clicktozoom="true" >}} |
| 87 | + |
| 88 | +Some other key JobSet features which address the problems described above include: |
| 89 | + |
| 90 | +Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in |
| 91 | +islands of homogenous accelerators connected via a specialized, high bandwidth network links. For |
| 92 | +example, a user might provision nodes containing a group of hosts co-located on a rack, each with |
| 93 | +H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch |
| 94 | +connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of |
| 95 | +64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed |
| 96 | +training job across multiple of these islands, we often want to partition the workload into a group |
| 97 | +of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within |
| 98 | +the same island to do segments of distributed computation, and keeping the gradient synchronization |
| 99 | +over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum. |
| 100 | + |
| 101 | +Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod |
| 102 | +communication via pod hostname is enabled by default, with automatic configuration and lifecycle |
| 103 | +management of the headless service enabling this. |
| 104 | + |
| 105 | +Configurable success policies : JobSet has configurable success policies which target specific |
| 106 | +ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can |
| 107 | +configure the JobSet to be marked complete if and only if all pods that are part of the “worker” |
| 108 | +ReplicatedJob are completed. |
| 109 | + |
| 110 | +Configurable failure policies : JobSet has configurable failure policies which allow the user to |
| 111 | +specify a maximum number of times the JobSet should be restarted in the event of a failure. If any |
| 112 | +job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the |
| 113 | +last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails. |
| 114 | + |
| 115 | +Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1 |
| 116 | +exclusive assignment to a topology domain, typically an accelerator island like a rack. For example, |
| 117 | +if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job |
| 118 | +will be co-located on the same island, and that only one child job is allowed to schedule per |
| 119 | +island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training |
| 120 | +strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices), |
| 121 | +running 1 model replica in each accelerator island, ensuring the forward and backward passes |
| 122 | +themselves occur within a single model replica occurs over the high bandwidth interconnect linking |
| 123 | +the accelerators chips within the island, and only the gradient synchronization between model |
| 124 | +replicas occurs across accelerator islands over the lower bandwidth data center network. |
| 125 | + |
| 126 | +Integration with Kueue : Users can submit JobSets via [Kueue](https://kueue.sigs.k8s.io/) to |
| 127 | +oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial |
| 128 | +scheduling and deadlocks, enable multi-tenancy, and more. |
| 129 | + |
| 130 | +## Example use case |
| 131 | + |
| 132 | +### Distributed ML training on multiple TPU slices with Jax |
| 133 | + |
| 134 | +The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e |
| 135 | +[slices](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#slices). To learn more about |
| 136 | +TPU concepts and terminology, please refer to these |
| 137 | +[docs](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm). |
| 138 | + |
| 139 | +This example uses [Jax](https://jax.readthedocs.io/en/latest/quickstart.html), an ML framework with |
| 140 | +native support for Just-In-Time (JIT) compilation targeting TPU chips via |
| 141 | +[OpenXLA](https://github.com/openxla). However, you can also use |
| 142 | +[PyTorch/XLA](https://pytorch.org/xla/release/2.3/index.html) to do ML training on TPUs. |
| 143 | + |
| 144 | +This example makes use of several JobSet features (both explicitly and implicitly) to support the |
| 145 | +unique scheduling requirements of TPU multislice training out-of-the-box with very little |
| 146 | +configuration required by the user. |
| 147 | + |
| 148 | +```yaml |
| 149 | +# Run a simple Jax workload on |
| 150 | +apiVersion: jobset.x-k8s.io/v1alpha2 |
| 151 | +kind: JobSet |
| 152 | +metadata: |
| 153 | + name: multislice |
| 154 | + annotations: |
| 155 | + # Give each child Job exclusive usage of a TPU slice |
| 156 | + alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool |
| 157 | +spec: |
| 158 | + failurePolicy: |
| 159 | + maxRestarts: 3 |
| 160 | + replicatedJobs: |
| 161 | + - name: workers |
| 162 | + replicas: 4 # Set to number of TPU slices |
| 163 | + template: |
| 164 | + spec: |
| 165 | + parallelism: 2 # Set to number of VMs per TPU slice |
| 166 | + completions: 2 # Set to number of VMs per TPU slice |
| 167 | + backoffLimit: 0 |
| 168 | + template: |
| 169 | + spec: |
| 170 | + hostNetwork: true |
| 171 | + dnsPolicy: ClusterFirstWithHostNet |
| 172 | + nodeSelector: |
| 173 | + cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice |
| 174 | + cloud.google.com/gke-tpu-topology: 2x4 |
| 175 | + containers: |
| 176 | + - name: jax-tpu |
| 177 | + image: python:3.8 |
| 178 | + ports: |
| 179 | + - containerPort: 8471 |
| 180 | + - containerPort: 8080 |
| 181 | + securityContext: |
| 182 | + privileged: true |
| 183 | + command: |
| 184 | + - bash |
| 185 | + - -c |
| 186 | + - | |
| 187 | + pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html |
| 188 | + python -c 'import jax; print("Global device count:", jax.device_count())' |
| 189 | + sleep 60 |
| 190 | + resources: |
| 191 | + limits: |
| 192 | + google.com/tpu: 4 |
| 193 | +``` |
| 194 | +
|
| 195 | +## Future work and getting involved |
| 196 | +We have a number of features on the JobSet roadmap planned for development this year, which can be |
| 197 | +found in the [JobSet roadmap](https://github.com/kubernetes-sigs/jobset?tab=readme-ov-file#roadmap). |
| 198 | +
|
| 199 | +Please feel free to reach out with feedback of any kind. We’re also open to additional contributors, |
| 200 | +whether it is to fix or report bugs, or help add new features or write documentation. |
| 201 | +
|
| 202 | +You can get in touch with us via our [repo](http://sigs.k8s.io/jobset), [mailing |
| 203 | +list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on |
| 204 | +[Slack](https://kubernetes.slack.com/messages/wg-batch). |
| 205 | +
|
| 206 | +Last but not least, thanks to all [our |
| 207 | +contributors](https://github.com/kubernetes-sigs/jobset/graphs/contributors) who made this project |
| 208 | +possible! |
0 commit comments