Skip to content

Commit df87009

Browse files
authored
Add KubeCon EU 2025 tutorial (#137)
1 parent 9342205 commit df87009

File tree

1 file changed

+317
-0
lines changed

1 file changed

+317
-0
lines changed

setup.KubeConEU25/README.md

+317
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
# MLBatch Tutorial
2+
3+
In this tutorial, we walk through all the steps necessary to setup MLBatch on a
4+
Kubernetes cluster and run a few example workloads. Prior to the [cluster
5+
setup](../setup.k8s/CLUSTER-SETUP.md), we will configure storage classes and
6+
Prometheus. We will configure team `blue` with user `alice` and `red` with user
7+
`bob` following the [team setup](../setup.k8s/TEAM-SETUP.md).
8+
9+
## Cluster Characteristics
10+
11+
Our target cluster comprises three control planes nodes and three worker nodes
12+
running Kubernetes 1.29 (from OpenShift 4.16.36).
13+
```sh
14+
kubectl get nodes
15+
```
16+
```
17+
NAME STATUS ROLES AGE VERSION
18+
pokprod-b93r38s3 Ready worker 5d13h v1.29.11+148a389
19+
pokprod-b93r39s2 Ready worker 5d12h v1.29.11+148a389
20+
pokprod-b93r44s0 Ready worker 5d13h v1.29.11+148a389
21+
pokprod002ctrl0 Ready control-plane,master 5d15h v1.29.11+148a389
22+
pokprod002ctrl1 Ready control-plane,master 5d15h v1.29.11+148a389
23+
pokprod002ctrl2 Ready control-plane,master 5d15h v1.29.11+148a389
24+
```
25+
Each worker node is equipped with eight H100 NVIDIA gpus.
26+
```sh
27+
kubectl describe node pokprod-b93r38s3
28+
```
29+
```
30+
Name: pokprod-b93r38s3
31+
Roles: worker
32+
Labels: beta.kubernetes.io/arch=amd64
33+
...
34+
nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
35+
...
36+
nvidia.com/gpu.count=8
37+
...
38+
Capacity:
39+
cpu: 224
40+
ephemeral-storage: 1873933640Ki
41+
hugepages-1Gi: 0
42+
hugepages-2Mi: 0
43+
memory: 2113411308Ki
44+
nvidia.com/gpu: 8
45+
openshift.io/p0_storage_sriov_nodepolicy: 8
46+
pods: 250
47+
rdma/roce_gdr: 0
48+
...
49+
```
50+
For this tutorial, we assume the [NVIDIA GPU
51+
operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html)
52+
is already
53+
[installed](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)
54+
on the cluster. While this cluster is capable of [GPU-direct RDMA (GDR) with
55+
ROCE (RDMA over Converged
56+
Ethernet)](https://medium.com/@sunyanan.choochotkaew1/unlocking-gpudirect-rdma-on-roce-in-kubernetes-based-cluster-on-cloud-through-multi-nic-cni-1e69ffb96296),
57+
we will not cover advanced networking topics in this tutorial and disable this
58+
feature.
59+
60+
## Storage Setup
61+
62+
We assume storage is available by means of preconfigured
63+
[NFS](https://en.wikipedia.org/wiki/Network_File_System) servers. We configure
64+
two storage classes using the [NFS Subdir External
65+
Provisioner](https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner).
66+
```sh
67+
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner
68+
helm repo update
69+
```
70+
```
71+
helm install -n nfs-provisioner simplenfs nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
72+
--create-namespace \
73+
--set nfs.server=192.168.95.253 --set nfs.path=/var/repo/root/nfs \
74+
--set storageClass.name=nfs-client-simplenfs --set storageClass.provisionerName=k8s-sigs.io/simplenfs-nfs-subdir-external-provisioner
75+
76+
helm install -n nfs-provisioner pokprod nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
77+
--create-namespace \
78+
--set nfs.server=192.168.98.96 --set nfs.path=/gpfs/fs_ec/pokprod002 \
79+
--set storageClass.name=nfs-client-pokprod --set storageClass.provisionerName=k8s-sigs.io/pokprod-nfs-subdir-external-provisioner
80+
```
81+
Make sure to replace the server ips and paths above with the right one for your
82+
environment. While we make use of both storage classes in the remainder of the
83+
tutorial for the sake of demonstration, everything could be done with a single
84+
class.
85+
```sh
86+
kubectl get storageclasses
87+
```
88+
```
89+
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
90+
nfs-client-pokprod k8s-sigs.io/pokprod-nfs-subdir-external-provisioner Delete Immediate true 11s
91+
nfs-client-simplenfs k8s-sigs.io/simplenfs-nfs-subdir-external-provisioner Delete Immediate true 15s
92+
```
93+
94+
## Prometheus Setup
95+
96+
TODO
97+
98+
## MLBatch Cluster Setup
99+
100+
We follow instructions from [CLUSTER-SETUP.md](../setup.k8s/CLUSTER-SETUP.md).
101+
102+
```sh
103+
# Clone MLBatch repository
104+
git clone --recursive https://github.com/project-codeflare/mlbatch.git
105+
cd mlbatch
106+
107+
# Setup priority classes
108+
kubectl apply -f setup.k8s/mlbatch-priorities.yaml
109+
110+
# Deploy Coscheduler
111+
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ --set-json pluginConfig='[{"args":{"s
112+
coringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityR
113+
atio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
114+
115+
# Wait for Coscheduler pods to be running
116+
kubectl get pods -n scheduler-plugins
117+
118+
# Patch Coscheduler pod priorities
119+
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s/coscheduler-priority-patch.yaml scheduler-plugins-controller
120+
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
121+
122+
# Create mlbatch-system namespace
123+
kubectl create namespace mlbatch-system
124+
125+
# Deploy Kubeflow training operator
126+
kubectl apply --server-side -k setup.k8s/training-operator
127+
128+
# Deploy Kuberay
129+
kubectl apply --server-side -k setup.k8s/kuberay
130+
131+
# Deploy Kueue
132+
kubectl apply --server-side -k setup.k8s/kueue
133+
134+
# Wait for Kueue to be running
135+
kubectl get pods -n kueue-system
136+
137+
# Deploy AppWrapper
138+
kubectl apply --server-side -k setup.k8s/appwrapper
139+
140+
# Deploy Autopilot
141+
helm repo add autopilot https://ibm.github.io/autopilot/
142+
helm repo update
143+
144+
helm upgrade autopilot autopilot/autopilot --install -n autopilot --create-namespace
145+
146+
kubectl label servicemonitors -n autopilot autopilot-metrics-monitor release=kube-prometheus-stack --overwrite
147+
148+
# Create Kueue's default flavor
149+
kubectl apply -f setup.k8s/default-flavor.yaml
150+
151+
# Setup mlbatch-edit-role
152+
kubectl apply -f setup.k8s/mlbatch-edit-role.yaml
153+
154+
# Create slack cluster queue with 8 gpus
155+
kubectl apply -f- << EOF
156+
kind: ClusterQueue
157+
metadata:
158+
name: slack-cluster-queue
159+
spec:
160+
namespaceSelector: {}
161+
cohort: default-cohort
162+
preemption:
163+
withinClusterQueue: LowerOrNewerEqualPriority
164+
reclaimWithinCohort: Any
165+
borrowWithinCohort:
166+
policy: Never
167+
resourceGroups:
168+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "pods"]
169+
flavors:
170+
- name: default-flavor
171+
resources:
172+
- name: "cpu"
173+
nominalQuota: 224
174+
- name: "memory"
175+
nominalQuota: 2000G
176+
- name: "nvidia.com/gpu"
177+
nominalQuota: 8
178+
- name: "pods"
179+
nominalQuota: 100
180+
EOF
181+
```
182+
We reserve 8 GPUs out of 24 for MLBatch's slack queue.
183+
184+
# Autopilot Extended Setup
185+
186+
TODO
187+
188+
## MLBatch Teams Setup
189+
190+
We configure team `blue` with user `alice` and `red` with user `bob` following
191+
the [team setup](../setup.k8s/TEAM-SETUP.md). Each team has a nominal quota of
192+
eight GPUs.
193+
```sh
194+
# Create namespaces
195+
kubectl create ns blue
196+
kubectl create ns red
197+
198+
kubectl label namespace blue mlbatch-team-namespace=true
199+
kubectl label namespace red mlbatch-team-namespace=true
200+
201+
# Create queues
202+
kubectl -n blue apply -f- << EOF
203+
kind: ClusterQueue
204+
metadata:
205+
name: blue-cluster-queue
206+
spec:
207+
namespaceSelector: {}
208+
cohort: default-cohort
209+
preemption:
210+
withinClusterQueue: LowerOrNewerEqualPriority
211+
reclaimWithinCohort: Any
212+
borrowWithinCohort:
213+
policy: Never
214+
resourceGroups:
215+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "pods"]
216+
flavors:
217+
- name: default-flavor
218+
resources:
219+
- name: "cpu"
220+
nominalQuota: 224
221+
- name: "memory"
222+
nominalQuota: 2000G
223+
- name: "nvidia.com/gpu"
224+
nominalQuota: 8
225+
- name: "pods"
226+
nominalQuota: 100
227+
EOF
228+
229+
kubectl apply -n blue -f- << EOF
230+
apiVersion: kueue.x-k8s.io/v1beta1
231+
kind: LocalQueue
232+
metadata:
233+
name: default-queue
234+
spec:
235+
clusterQueue: blue-cluster-queue
236+
EOF
237+
238+
kubectl apply -n red -f- << EOF
239+
kind: ClusterQueue
240+
metadata:
241+
name: red-cluster-queue
242+
spec:
243+
namespaceSelector: {}
244+
cohort: default-cohort
245+
preemption:
246+
withinClusterQueue: LowerOrNewerEqualPriority
247+
reclaimWithinCohort: Any
248+
borrowWithinCohort:
249+
policy: Never
250+
resourceGroups:
251+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "pods"]
252+
flavors:
253+
- name: default-flavor
254+
resources:
255+
- name: "cpu"
256+
nominalQuota: 224
257+
- name: "memory"
258+
nominalQuota: 2000G
259+
- name: "nvidia.com/gpu"
260+
nominalQuota: 8
261+
- name: "pods"
262+
nominalQuota: 100
263+
EOF
264+
265+
kubectl apply -n red -f- << EOF
266+
apiVersion: kueue.x-k8s.io/v1beta1
267+
kind: LocalQueue
268+
metadata:
269+
name: default-queue
270+
spec:
271+
clusterQueue: red-cluster-queue
272+
EOF
273+
274+
# Authorize alice and bob in their respective namespaces
275+
kubectl -n blue apply -f- << EOF
276+
kind: RoleBinding
277+
apiVersion: rbac.authorization.k8s.io/v1
278+
metadata:
279+
name: alice
280+
subjects:
281+
- apiGroup: rbac.authorization.k8s.io
282+
kind: User
283+
name: alice
284+
roleRef:
285+
apiGroup: rbac.authorization.k8s.io
286+
kind: ClusterRole
287+
name: mlbatch-edit
288+
EOF
289+
290+
kubectl -n red apply -f- << EOF
291+
kind: RoleBinding
292+
apiVersion: rbac.authorization.k8s.io/v1
293+
metadata:
294+
name: bob
295+
subjects:
296+
- apiGroup: rbac.authorization.k8s.io
297+
kind: User
298+
name: bob
299+
roleRef:
300+
apiGroup: rbac.authorization.k8s.io
301+
kind: ClusterRole
302+
name: mlbatch-edit
303+
EOF
304+
```
305+
While we gave permissions to Kubernetes users `alice` and `bob`, we have not
306+
tied these names to any identity provider as the details of this setup are not
307+
portable. In this tutorial, we will rely on [user
308+
impersonation](https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-impersonation)
309+
with `kubectl` to run as a specific user.
310+
311+
## Batch Inference with vLLM
312+
313+
TODO
314+
315+
## Pre-Training with PyTorch
316+
317+
TODO

0 commit comments

Comments
 (0)