@@ -91,7 +91,109 @@ oc adm policy add-scc-to-user hostmount-anyuid system:serviceaccount:nfs-provisi
91
91
92
92
### Prometheus Setup
93
93
94
- TODO
94
+ We follow the setup provided by the ` prometheus-community/kube-prometheus-stack ` Helm chart.
95
+
96
+ ``` bash
97
+ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update
98
+ ```
99
+
100
+ The charts will install: Prometheus, Grafana, Alert Manager, Prometheus Node Exporter and Kube State Metrics. We set up the chart with the following:
101
+
102
+ - Persistent storage for Prometheus, Grafana and Alert Manager;
103
+ - Override the Prometheus Node Exporter port;
104
+ - Disable CRDs creation as they are already present.
105
+
106
+ You may leave the CRDs creation on, along with the default Node Exporter pod. These changes are needed when deploying a separate Prometheus instance in OpenShift.
107
+
108
+ ``` bash
109
+ cat << EOF >> config.yaml
110
+ crds:
111
+ enabled: false
112
+
113
+ prometheus-node-exporter:
114
+ service:
115
+ port: 9110
116
+
117
+ alertmanager:
118
+ alertmanagerSpec:
119
+ persistentVolumeClaimRetentionPolicy:
120
+ whenDeleted: Retain
121
+ whenScaled: Retain
122
+ storage:
123
+ volumeClaimTemplate:
124
+ spec:
125
+ storageClassName: nfs-client-pokprod
126
+ accessModes: ["ReadWriteOnce"]
127
+ resources:
128
+ requests:
129
+ storage: 50Gi
130
+
131
+ prometheus:
132
+ prometheusSpec:
133
+ persistentVolumeClaimRetentionPolicy:
134
+ whenDeleted: Retain
135
+ whenScaled: Retain
136
+ storageSpec:
137
+ volumeClaimTemplate:
138
+ spec:
139
+ storageClassName: nfs-client-pokprod
140
+ accessModes: ["ReadWriteOnce"]
141
+ resources:
142
+ requests:
143
+ storage: 50Gi
144
+ emptyDir:
145
+ medium: Memory
146
+
147
+ grafana:
148
+ persistence:
149
+ enabled: true
150
+ type: sts
151
+ storageClassName: "nfs-client-pokprod"
152
+ accessModes:
153
+ - ReadWriteOnce
154
+ size: 20Gi
155
+ finalizers:
156
+ - kubernetes.io/pvc-protection
157
+ EOF
158
+
159
+ helm upgrade -i kube-prometheus-stack -n prometheus prometheus-community/kube-prometheus-stack --create-namespace -f config.yaml
160
+ ```
161
+
162
+ If deploying on OpenShift based systems, you need to assign the privileged security context to the service accounts that are created by the helm chart.
163
+
164
+ ``` bash
165
+ oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:kube-prometheus-stack-admission system:serviceaccount:prometheus:kube-prometheus-stack-alertmanager system:serviceaccount:prometheus:kube-prometheus-stack-grafana system:serviceaccount:prometheus:kube-prometheus-stack-kube-state-metrics system:serviceaccount:prometheus:kube-prometheus-stack-operator system:serviceaccount:prometheus:kube-prometheus-stack-prometheus system:serviceaccount:prometheus:kube-prometheus-stack-prometheus-node-exporter
166
+ ```
167
+
168
+ You should expect the following pods:
169
+
170
+ ``` bash
171
+ kubectl get pods
172
+ ```
173
+ ``` bash
174
+ NAME READY STATUS RESTARTS AGE
175
+ alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 16m
176
+ kube-prometheus-stack-grafana-0 3/3 Running 0 16m
177
+ kube-prometheus-stack-kube-state-metrics-6f76b98d89-pxs69 1/1 Running 0 16m
178
+ kube-prometheus-stack-operator-7fbfc985bb-mm9bk 1/1 Running 0 16m
179
+ kube-prometheus-stack-prometheus-node-exporter-44llp 1/1 Running 0 16m
180
+ kube-prometheus-stack-prometheus-node-exporter-95gp8 1/1 Running 0 16m
181
+ kube-prometheus-stack-prometheus-node-exporter-dxf5f 1/1 Running 0 16m
182
+ kube-prometheus-stack-prometheus-node-exporter-f45dx 1/1 Running 0 16m
183
+ kube-prometheus-stack-prometheus-node-exporter-pfrzk 1/1 Running 0 16m
184
+ kube-prometheus-stack-prometheus-node-exporter-zpfzb 1/1 Running 0 16m
185
+ prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 16m
186
+ ```
187
+
188
+ To access the Grafana dashboard on ` localhost:3000 ` :
189
+
190
+ ``` bash
191
+ kubectl --namespace prometheus get secrets kube-prometheus-stack-grafana -o jsonpath=" {.data.admin-password}" | base64 -d ; echo
192
+ ```
193
+ ``` bash
194
+ export POD_NAME=$( kubectl --namespace prometheus get pod -l " app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -oname)
195
+ kubectl --namespace prometheus port-forward $POD_NAME 3000
196
+ ```
95
197
96
198
### MLBatch Cluster Setup
97
199
@@ -180,7 +282,23 @@ We reserve 8 GPUs out of 24 for MLBatch's slack queue.
180
282
181
283
### Autopilot Extended Setup
182
284
183
- TODO
285
+ It is possible to configure Autopilot so that it will test PVC creation and deletion given a storage class name.
286
+
287
+ ``` bash
288
+ cat << EOF >> autopilot-extended.yaml
289
+ env:
290
+ - name: "PERIODIC_CHECKS"
291
+ value: "pciebw,remapped,dcgm,ping,gpupower,pvc"
292
+ - name: "PVC_TEST_STORAGE_CLASS"
293
+ value: "nfs-client-pokprod"
294
+ EOF
295
+ ```
296
+
297
+ Then reapply the helm chart, this will start a rollout update.
298
+
299
+ ``` bash
300
+ helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f autopilot-extended.yaml
301
+ ```
184
302
185
303
### MLBatch Teams Setup
186
304
0 commit comments