Skip to content

Commit 79d10b1

Browse files
authored
Sakkara documentation in SETUP for non-RHOAI clusters (#138)
1 parent df87009 commit 79d10b1

17 files changed

+175
-17
lines changed

setup.RHOAI-v2.13/CLUSTER-SETUP.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,12 @@ Create `default-priority`, `high-priority`, and `low-priority` priority classes:
1010
oc apply -f setup.RHOAI-v2.13/mlbatch-priorities.yaml
1111
```
1212

13-
## Coscheduler
13+
## Scheduler Plugins
14+
15+
MLBatch utilizes Kubernetes Scheduler Plugins to ensure gang scheduling of
16+
multi-Pod workloads and to pack `Pods` onto `Nodes` to reduce GPU fragmentation.
17+
18+
### Coscheduler
1419

1520
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
1621
```sh
@@ -24,6 +29,8 @@ oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2
2429
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.13/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
2530
```
2631

32+
33+
2734
## Red Hat OpenShift AI
2835

2936
Create the Red Hat OpenShift AI subscription:

setup.RHOAI-v2.16/CLUSTER-SETUP.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,12 @@ Create `default-priority`, `high-priority`, and `low-priority` priority classes:
1010
oc apply -f setup.RHOAI-v2.16/mlbatch-priorities.yaml
1111
```
1212

13-
## Coscheduler
13+
## Scheduler Plugins
14+
15+
MLBatch utilizes Kubernetes Scheduler Plugins to ensure gang scheduling of
16+
multi-Pod workloads and to pack `Pods` onto `Nodes` to reduce GPU fragmentation.
17+
18+
### Coscheduler
1419

1520
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
1621
```sh
@@ -24,6 +29,8 @@ oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2
2429
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.16/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
2530
```
2631

32+
33+
2734
## Red Hat OpenShift AI
2835

2936
Create the Red Hat OpenShift AI subscription:

setup.RHOAI-v2.17/CLUSTER-SETUP.md

+8-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,12 @@ Create `default-priority`, `high-priority`, and `low-priority` priority classes:
1010
oc apply -f setup.RHOAI-v2.17/mlbatch-priorities.yaml
1111
```
1212

13-
## Coscheduler
13+
## Scheduler Plugins
14+
15+
MLBatch utilizes Kubernetes Scheduler Plugins to ensure gang scheduling of
16+
multi-Pod workloads and to pack `Pods` onto `Nodes` to reduce GPU fragmentation.
17+
18+
### Coscheduler
1419

1520
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
1621
```sh
@@ -24,6 +29,8 @@ oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2
2429
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.17/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
2530
```
2631

32+
33+
2734
## Red Hat OpenShift AI
2835

2936
Create the Red Hat OpenShift AI subscription:

setup.k8s/CLUSTER-SETUP.md

+35-6
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Cluster Setup
22

33
The cluster setup installs and configures the following components:
4-
+ Coscheduler
4+
+ Scheduler Plugins
55
+ Kubeflow Training Operator
66
+ KubeRay
77
+ Kueue
@@ -16,7 +16,13 @@ Create `default-priority`, `high-priority`, and `low-priority` priority classes:
1616
kubectl apply -f setup.k8s/mlbatch-priorities.yaml
1717
```
1818

19-
## Coscheduler
19+
## Scheduler Plugins
20+
21+
MLBatch utilizes Kubernetes Scheduler Plugins to ensure gang scheduling of
22+
multi-Pod workloads and to pack `Pods` onto `Nodes` to reduce GPU fragmentation.
23+
Two options are described below: Coscheduler and Sakkara. You should pick and install one of them
24+
as a secondary scheduler for your cluster.
25+
### Coscheduler
2026

2127
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
2228
```sh
@@ -30,6 +36,17 @@ kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s
3036
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
3137
```
3238

39+
### Sakkara
40+
41+
[Sakkara](https://github.com/atantawi/scheduler-plugins/tree/sakkara) is an experimental
42+
new scheduler plugin with advanced support for topology-aware scheduling.
43+
44+
Install Sakkara as a secondary scheduler:
45+
```sh
46+
helm install sakkara-scheduler --namespace sakkara-scheduler --create-namespace mlbatch/sakkara-scheduler
47+
```
48+
Optionally, create a config map capturing your cluster's topology as described in the [Sakkara documentation](https://github.com/atantawi/sakkara-deploy/tree/main?tab=readme-ov-file#cluster-topology). This step is optional but recommended for production clusters. If the config map is not present Sakkara will default to a single-level hierarchy containing the Nodes of the cluster.
49+
3350
## Install Operators
3451

3552
Create the mlbatch-system namespace
@@ -38,8 +55,14 @@ kubectl create namespace mlbatch-system
3855
```
3956

4057
Install the Kubeflow Training Operator
58+
59+
If you are using Coscheduler do:
60+
```sh
61+
kubectl apply --server-side -k setup.k8s/training-operator/coscheduler
62+
```
63+
If you are using Sakkara do:
4164
```sh
42-
kubectl apply --server-side -k setup.k8s/training-operator
65+
kubectl apply --server-side -k setup.k8s/training-operator/sakkara
4366
```
4467

4568
Install the KubeRay Operator
@@ -53,13 +76,19 @@ kubectl apply --server-side -k setup.k8s/kueue
5376
```
5477

5578
Install the AppWrapper Operator
79+
If you are using Coscheduler do:
5680
```sh
57-
kubectl apply --server-side -k setup.k8s/appwrapper
81+
kubectl apply --server-side -k setup.k8s/appwrapper/coscheduler
5882
```
83+
If you are using Sakkara do:
84+
```sh
85+
kubectl apply --server-side -k setup.k8s/appwrapper/sakkara
86+
```
87+
5988
The provided configuration differs from the default configuration of the
6089
operators as follows:
6190
- Kubeflow Training Operator:
62-
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
91+
- `gang-scheduler-name` is set to either `scheduler-plugins-scheduler` or `sakkara-scheduler`,
6392
- Kueue:
6493
- `batch/job` integration is disabled,
6594
- `manageJobsWithoutQueueName` is enabled and configured via `managedJobsNamespaceSelector` to be
@@ -70,7 +99,7 @@ operators as follows:
7099
- `enableClusterQueueResources` metrics is enabled,
71100
- AppWrapper operator:
72101
- `userRBACAdmissionCheck` is disabled,
73-
- `schedulerName` is set to `scheduler-plugins-scheduler`,
102+
- `schedulerName` is set to `scheduler-plugins-scheduler` or `sakkara-scheduler`,
74103
- `queueName` is set to `default-queue`,
75104
- pod priorities, resource requests and limits have been adjusted.
76105

setup.k8s/UNINSTALL.md

+4
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,8 @@ kubectl delete clusterrole mlbatch-edit
2020
# Coscheduler uninstall
2121
helm uninstall -n scheduler-plugins scheduler-plugins
2222
kubectl delete namespace scheduler-plugins
23+
24+
# Sakkara uninstall
25+
helm uninstall -n sakkara-scheduler sakkara-scheduler
26+
kubectl delete namespace sakkara-scheduler
2327
```

setup.k8s/appwrapper/kustomization.yaml setup.k8s/appwrapper/base/kustomization.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,5 @@ images:
1717
newTag: v0.30.0
1818

1919
patches:
20-
- path: config_patch.yaml
2120
- path: manager_resources_patch.yaml
2221
- path: remove_default_namespace.yaml
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
namespace: mlbatch-system
4+
5+
resources:
6+
- ../base
7+
8+
patches:
9+
patches:
10+
- path: config_patch.yaml
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
kind: ConfigMap
2+
apiVersion: v1
3+
metadata:
4+
name: appwrapper-operator-config
5+
namespace: appwrapper-system
6+
data:
7+
config.yaml: |
8+
appwrapper:
9+
enableKueueIntegrations: true
10+
kueueJobReconciller:
11+
manageJobsWithoutQueueName: true
12+
waitForPodsReady:
13+
enable: false
14+
defaultQueueName: default-queue
15+
schedulerName: sakkara-scheduler
16+
slackQueueName: slack-cluster-queue
17+
userRBACAdmissionCheck: false
18+
controllerManager:
19+
health:
20+
bindAddress: ":8081"
21+
metrics:
22+
bindAddress: "127.0.0.1:8080"
23+
leaderElection: true
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
namespace: mlbatch-system
4+
5+
resources:
6+
- ../base
7+
8+
patches:
9+
patches:
10+
- path: config_patch.yaml

setup.k8s/training-operator/manager_resources_patch.yaml setup.k8s/training-operator/base/manager_resources_patch.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@ spec:
1010
- name: training-operator
1111
args:
1212
- "--zap-log-level=2"
13-
- "--gang-scheduler-name=scheduler-plugins-scheduler"
1413
resources:
1514
requests:
1615
cpu: 100m
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
namespace: mlbatch-system
4+
5+
resources:
6+
- ../base
7+
8+
patches:
9+
- target:
10+
kind: Deployment
11+
name: training-operator
12+
patch: |
13+
- op: add
14+
path: /spec/template/spec/containers/0/args/-
15+
value: "--gang-scheduler-name=scheduler-plugins-scheduler"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
namespace: mlbatch-system
4+
5+
resources:
6+
- ../base
7+
8+
patches:
9+
- target:
10+
kind: Deployment
11+
name: training-operator
12+
patch: |
13+
- op: add
14+
path: /spec/template/spec/containers/0/args/-
15+
value: "--gang-scheduler-name=sakkara-scheduler"

setup.tmpl/CLUSTER-SETUP.md.tmpl

+39-6
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ cluster roles, and priority classes.
66

77
{{- else -}}
88
The cluster setup installs and configures the following components:
9-
+ Coscheduler
9+
+ Scheduler Plugins
1010
+ Kubeflow Training Operator
1111
+ KubeRay
1212
+ Kueue
@@ -23,7 +23,15 @@ Create `default-priority`, `high-priority`, and `low-priority` priority classes:
2323
{{ .KUBECTL }} apply -f setup.{{ .VERSION }}/mlbatch-priorities.yaml
2424
```
2525

26-
## Coscheduler
26+
## Scheduler Plugins
27+
28+
MLBatch utilizes Kubernetes Scheduler Plugins to ensure gang scheduling of
29+
multi-Pod workloads and to pack `Pods` onto `Nodes` to reduce GPU fragmentation.
30+
{{ if not .OPENSHIFT -}}
31+
Two options are described below: Coscheduler and Sakkara. You should pick and install one of them
32+
as a secondary scheduler for your cluster.
33+
{{- end }}
34+
### Coscheduler
2735

2836
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
2937
```sh
@@ -37,6 +45,19 @@ Patch Coscheduler pod priorities:
3745
{{ .KUBECTL }} patch deployment -n scheduler-plugins --type=json --patch-file setup.{{ .VERSION }}/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
3846
```
3947

48+
{{ if not .OPENSHIFT -}}
49+
### Sakkara
50+
51+
[Sakkara](https://github.com/atantawi/scheduler-plugins/tree/sakkara) is an experimental
52+
new scheduler plugin with advanced support for topology-aware scheduling.
53+
54+
Install Sakkara as a secondary scheduler:
55+
```sh
56+
helm install sakkara-scheduler --namespace sakkara-scheduler --create-namespace mlbatch/sakkara-scheduler
57+
```
58+
Optionally, create a config map capturing your cluster's topology as described in the [Sakkara documentation](https://github.com/atantawi/sakkara-deploy/tree/main?tab=readme-ov-file#cluster-topology). This step is optional but recommended for production clusters. If the config map is not present Sakkara will default to a single-level hierarchy containing the Nodes of the cluster.
59+
{{- end }}
60+
4061
{{ if .OPENSHIFT -}}
4162
## Red Hat OpenShift AI
4263

@@ -115,8 +136,14 @@ Create the mlbatch-system namespace
115136
```
116137

117138
Install the Kubeflow Training Operator
139+
140+
If you are using Coscheduler do:
141+
```sh
142+
{{ .KUBECTL }} apply --server-side -k setup.{{ .VERSION }}/training-operator/coscheduler
143+
```
144+
If you are using Sakkara do:
118145
```sh
119-
{{ .KUBECTL }} apply --server-side -k setup.{{ .VERSION }}/training-operator
146+
{{ .KUBECTL }} apply --server-side -k setup.{{ .VERSION }}/training-operator/sakkara
120147
```
121148

122149
Install the KubeRay Operator
@@ -130,13 +157,19 @@ Install Kueue
130157
```
131158

132159
Install the AppWrapper Operator
160+
If you are using Coscheduler do:
133161
```sh
134-
{{ .KUBECTL }} apply --server-side -k setup.{{ .VERSION }}/appwrapper
162+
{{ .KUBECTL }} apply --server-side -k setup.{{ .VERSION }}/appwrapper/coscheduler
135163
```
164+
If you are using Sakkara do:
165+
```sh
166+
{{ .KUBECTL }} apply --server-side -k setup.{{ .VERSION }}/appwrapper/sakkara
167+
```
168+
136169
The provided configuration differs from the default configuration of the
137170
operators as follows:
138171
- Kubeflow Training Operator:
139-
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
172+
- `gang-scheduler-name` is set to either `scheduler-plugins-scheduler` or `sakkara-scheduler`,
140173
- Kueue:
141174
- `batch/job` integration is disabled,
142175
- `manageJobsWithoutQueueName` is enabled and configured via `managedJobsNamespaceSelector` to be
@@ -149,7 +182,7 @@ operators as follows:
149182
- `enableClusterQueueResources` metrics is enabled,
150183
- AppWrapper operator:
151184
- `userRBACAdmissionCheck` is disabled,
152-
- `schedulerName` is set to `scheduler-plugins-scheduler`,
185+
- `schedulerName` is set to `scheduler-plugins-scheduler` or `sakkara-scheduler`,
153186
- `queueName` is set to `default-queue`,
154187
- pod priorities, resource requests and limits have been adjusted.
155188

0 commit comments

Comments
 (0)