Skip to content

Commit fea8d97

Browse files
authored
Merge pull request #45759 from danielvegamyhre/jobset
Blog Post: Introducing JobSet
2 parents 5f71889 + 4651b33 commit fea8d97

File tree

2 files changed

+209
-0
lines changed

2 files changed

+209
-0
lines changed
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
---
2+
layout: blog
3+
title: "Introducing JobSet"
4+
date: 2025-02-03
5+
slug: introducing-jobset
6+
draft: true
7+
---
8+
9+
**Authors**: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)
10+
11+
In this article, we introduce [JobSet](https://jobset.sigs.k8s.io/), an open source API for
12+
representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML
13+
training and HPC workloads on Kubernetes.
14+
15+
## Why JobSet?
16+
17+
The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML
18+
engineers who have found it to be a natural fit for the requirements of running distributed training
19+
workloads.
20+
21+
Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a
22+
single host are often distributed across tens of thousands of accelerator chips, which in turn may
23+
span thousands of hosts.
24+
25+
As such, the model training code is often containerized and executed simultaneously on all these
26+
hosts, performing distributed computations which often shard both the model parameters and/or the
27+
training dataset across the target accelerator chips, using communication collective primitives like
28+
all-gather and all-reduce to perform distributed computations and synchronize gradients between
29+
hosts.
30+
31+
These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently
32+
scheduling and managing the lifecycle of containerized applications across a cluster of compute
33+
resources is an area where it shines.
34+
35+
It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and
36+
controllers which manage the behavior and life cycle of these objects, allowing engineers to develop
37+
custom distributed training orchestration solutions to fit their needs.
38+
39+
However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do
40+
not adequately model them alone anymore.
41+
42+
Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become
43+
fragmented, and each of the existing solutions in this fragmented landscape has certain limitations
44+
that make it non-optimal for distributed ML training.
45+
46+
For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g.
47+
PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit
48+
specifically to the target framework, each with different semantics and behavior.
49+
50+
On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed
51+
completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of
52+
the most recent enhancements. However, running ML training and HPC workloads using the upstream Job
53+
API requires extra orchestration to fill the following gaps:
54+
55+
Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different
56+
Pods are part of the same workload, but they need to run a different container, request different
57+
resources or have different failure policies. A common example is the driver-worker pattern.
58+
59+
Job groups : Large scale training workloads span multiple network topologies, running across
60+
multiple racks for example. Such workloads are network latency sensitive, and aim to localize
61+
communication and minimize traffic crossing the higher-latency network links. To facilitate this,
62+
the workload needs to be split into groups of Pods each assigned to a network topology.
63+
64+
Inter-Pod communication : Create and manage the resources (e.g. [headless
65+
Services](/docs/concepts/services-networking/service/#headless-services)) necessary to establish
66+
communication between the Pods of a job.
67+
68+
Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is
69+
expected to start first (like Ray or Spark), in other cases the workers are expected to be ready
70+
before starting the driver (like MPI).
71+
72+
JobSet aims to address those gaps using the Job API as a building block to build a richer API for
73+
large-scale distributed HPC and ML use cases.
74+
75+
## How JobSet Works
76+
JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to
77+
easily specify different pod templates for different distinct groups of pods (e.g. a leader,
78+
workers, parameter servers, etc.).
79+
80+
It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is
81+
essentially a Job Template with some desired number of Job replicas specified. This provides a
82+
declarative way to easily create identical child-jobs to run on different islands of accelerators,
83+
without resorting to scripting or Helm charts to generate many versions of the same job but with
84+
different names.
85+
86+
{{< figure src="jobset_diagram.svg" alt="JobSet Architecture" class="diagram-large" clicktozoom="true" >}}
87+
88+
Some other key JobSet features which address the problems described above include:
89+
90+
Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in
91+
islands of homogenous accelerators connected via a specialized, high bandwidth network links. For
92+
example, a user might provision nodes containing a group of hosts co-located on a rack, each with
93+
H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch
94+
connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of
95+
64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed
96+
training job across multiple of these islands, we often want to partition the workload into a group
97+
of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within
98+
the same island to do segments of distributed computation, and keeping the gradient synchronization
99+
over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum.
100+
101+
Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod
102+
communication via pod hostname is enabled by default, with automatic configuration and lifecycle
103+
management of the headless service enabling this.
104+
105+
Configurable success policies : JobSet has configurable success policies which target specific
106+
ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can
107+
configure the JobSet to be marked complete if and only if all pods that are part of the “worker”
108+
ReplicatedJob are completed.
109+
110+
Configurable failure policies : JobSet has configurable failure policies which allow the user to
111+
specify a maximum number of times the JobSet should be restarted in the event of a failure. If any
112+
job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the
113+
last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails.
114+
115+
Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1
116+
exclusive assignment to a topology domain, typically an accelerator island like a rack. For example,
117+
if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job
118+
will be co-located on the same island, and that only one child job is allowed to schedule per
119+
island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training
120+
strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices),
121+
running 1 model replica in each accelerator island, ensuring the forward and backward passes
122+
themselves occur within a single model replica occurs over the high bandwidth interconnect linking
123+
the accelerators chips within the island, and only the gradient synchronization between model
124+
replicas occurs across accelerator islands over the lower bandwidth data center network.
125+
126+
Integration with Kueue : Users can submit JobSets via [Kueue](https://kueue.sigs.k8s.io/) to
127+
oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial
128+
scheduling and deadlocks, enable multi-tenancy, and more.
129+
130+
## Example use case
131+
132+
### Distributed ML training on multiple TPU slices with Jax
133+
134+
The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e
135+
[slices](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#slices). To learn more about
136+
TPU concepts and terminology, please refer to these
137+
[docs](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm).
138+
139+
This example uses [Jax](https://jax.readthedocs.io/en/latest/quickstart.html), an ML framework with
140+
native support for Just-In-Time (JIT) compilation targeting TPU chips via
141+
[OpenXLA](https://github.com/openxla). However, you can also use
142+
[PyTorch/XLA](https://pytorch.org/xla/release/2.3/index.html) to do ML training on TPUs.
143+
144+
This example makes use of several JobSet features (both explicitly and implicitly) to support the
145+
unique scheduling requirements of TPU multislice training out-of-the-box with very little
146+
configuration required by the user.
147+
148+
```yaml
149+
# Run a simple Jax workload on
150+
apiVersion: jobset.x-k8s.io/v1alpha2
151+
kind: JobSet
152+
metadata:
153+
name: multislice
154+
annotations:
155+
# Give each child Job exclusive usage of a TPU slice
156+
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
157+
spec:
158+
failurePolicy:
159+
maxRestarts: 3
160+
replicatedJobs:
161+
- name: workers
162+
replicas: 4 # Set to number of TPU slices
163+
template:
164+
spec:
165+
parallelism: 2 # Set to number of VMs per TPU slice
166+
completions: 2 # Set to number of VMs per TPU slice
167+
backoffLimit: 0
168+
template:
169+
spec:
170+
hostNetwork: true
171+
dnsPolicy: ClusterFirstWithHostNet
172+
nodeSelector:
173+
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
174+
cloud.google.com/gke-tpu-topology: 2x4
175+
containers:
176+
- name: jax-tpu
177+
image: python:3.8
178+
ports:
179+
- containerPort: 8471
180+
- containerPort: 8080
181+
securityContext:
182+
privileged: true
183+
command:
184+
- bash
185+
- -c
186+
- |
187+
pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
188+
python -c 'import jax; print("Global device count:", jax.device_count())'
189+
sleep 60
190+
resources:
191+
limits:
192+
google.com/tpu: 4
193+
```
194+
195+
## Future work and getting involved
196+
We have a number of features on the JobSet roadmap planned for development this year, which can be
197+
found in the [JobSet roadmap](https://github.com/kubernetes-sigs/jobset?tab=readme-ov-file#roadmap).
198+
199+
Please feel free to reach out with feedback of any kind. We’re also open to additional contributors,
200+
whether it is to fix or report bugs, or help add new features or write documentation.
201+
202+
You can get in touch with us via our [repo](http://sigs.k8s.io/jobset), [mailing
203+
list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on
204+
[Slack](https://kubernetes.slack.com/messages/wg-batch).
205+
206+
Last but not least, thanks to all [our
207+
contributors](https://github.com/kubernetes-sigs/jobset/graphs/contributors) who made this project
208+
possible!

content/en/blog/_posts/2025-02-03-introducing-jobset/jobset_diagram.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)