Skip to content

Commit ccb04dc

Browse files
authored
Updates to readme for Prometheus and Autopilot (#147)
1 parent 6c04091 commit ccb04dc

File tree

1 file changed

+120
-2
lines changed

1 file changed

+120
-2
lines changed

setup.KubeConEU25/README.md

+120-2
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,109 @@ oc adm policy add-scc-to-user hostmount-anyuid system:serviceaccount:nfs-provisi
9191

9292
### Prometheus Setup
9393

94-
TODO
94+
We follow the setup provided by the `prometheus-community/kube-prometheus-stack` Helm chart.
95+
96+
```bash
97+
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update
98+
```
99+
100+
The charts will install: Prometheus, Grafana, Alert Manager, Prometheus Node Exporter and Kube State Metrics. We set up the chart with the following:
101+
102+
- Persistent storage for Prometheus, Grafana and Alert Manager;
103+
- Override the Prometheus Node Exporter port;
104+
- Disable CRDs creation as they are already present.
105+
106+
You may leave the CRDs creation on, along with the default Node Exporter pod. These changes are needed when deploying a separate Prometheus instance in OpenShift.
107+
108+
```bash
109+
cat << EOF >> config.yaml
110+
crds:
111+
enabled: false
112+
113+
prometheus-node-exporter:
114+
service:
115+
port: 9110
116+
117+
alertmanager:
118+
alertmanagerSpec:
119+
persistentVolumeClaimRetentionPolicy:
120+
whenDeleted: Retain
121+
whenScaled: Retain
122+
storage:
123+
volumeClaimTemplate:
124+
spec:
125+
storageClassName: nfs-client-pokprod
126+
accessModes: ["ReadWriteOnce"]
127+
resources:
128+
requests:
129+
storage: 50Gi
130+
131+
prometheus:
132+
prometheusSpec:
133+
persistentVolumeClaimRetentionPolicy:
134+
whenDeleted: Retain
135+
whenScaled: Retain
136+
storageSpec:
137+
volumeClaimTemplate:
138+
spec:
139+
storageClassName: nfs-client-pokprod
140+
accessModes: ["ReadWriteOnce"]
141+
resources:
142+
requests:
143+
storage: 50Gi
144+
emptyDir:
145+
medium: Memory
146+
147+
grafana:
148+
persistence:
149+
enabled: true
150+
type: sts
151+
storageClassName: "nfs-client-pokprod"
152+
accessModes:
153+
- ReadWriteOnce
154+
size: 20Gi
155+
finalizers:
156+
- kubernetes.io/pvc-protection
157+
EOF
158+
159+
helm upgrade -i kube-prometheus-stack -n prometheus prometheus-community/kube-prometheus-stack --create-namespace -f config.yaml
160+
```
161+
162+
If deploying on OpenShift based systems, you need to assign the privileged security context to the service accounts that are created by the helm chart.
163+
164+
```bash
165+
oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:kube-prometheus-stack-admission system:serviceaccount:prometheus:kube-prometheus-stack-alertmanager system:serviceaccount:prometheus:kube-prometheus-stack-grafana system:serviceaccount:prometheus:kube-prometheus-stack-kube-state-metrics system:serviceaccount:prometheus:kube-prometheus-stack-operator system:serviceaccount:prometheus:kube-prometheus-stack-prometheus system:serviceaccount:prometheus:kube-prometheus-stack-prometheus-node-exporter
166+
```
167+
168+
You should expect the following pods:
169+
170+
```bash
171+
kubectl get pods
172+
```
173+
```bash
174+
NAME READY STATUS RESTARTS AGE
175+
alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 16m
176+
kube-prometheus-stack-grafana-0 3/3 Running 0 16m
177+
kube-prometheus-stack-kube-state-metrics-6f76b98d89-pxs69 1/1 Running 0 16m
178+
kube-prometheus-stack-operator-7fbfc985bb-mm9bk 1/1 Running 0 16m
179+
kube-prometheus-stack-prometheus-node-exporter-44llp 1/1 Running 0 16m
180+
kube-prometheus-stack-prometheus-node-exporter-95gp8 1/1 Running 0 16m
181+
kube-prometheus-stack-prometheus-node-exporter-dxf5f 1/1 Running 0 16m
182+
kube-prometheus-stack-prometheus-node-exporter-f45dx 1/1 Running 0 16m
183+
kube-prometheus-stack-prometheus-node-exporter-pfrzk 1/1 Running 0 16m
184+
kube-prometheus-stack-prometheus-node-exporter-zpfzb 1/1 Running 0 16m
185+
prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 16m
186+
```
187+
188+
To access the Grafana dashboard on `localhost:3000`:
189+
190+
```bash
191+
kubectl --namespace prometheus get secrets kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
192+
```
193+
```bash
194+
export POD_NAME=$(kubectl --namespace prometheus get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -oname)
195+
kubectl --namespace prometheus port-forward $POD_NAME 3000
196+
```
95197

96198
### MLBatch Cluster Setup
97199

@@ -180,7 +282,23 @@ We reserve 8 GPUs out of 24 for MLBatch's slack queue.
180282

181283
### Autopilot Extended Setup
182284

183-
TODO
285+
It is possible to configure Autopilot so that it will test PVC creation and deletion given a storage class name.
286+
287+
```bash
288+
cat << EOF >> autopilot-extended.yaml
289+
env:
290+
- name: "PERIODIC_CHECKS"
291+
value: "pciebw,remapped,dcgm,ping,gpupower,pvc"
292+
- name: "PVC_TEST_STORAGE_CLASS"
293+
value: "nfs-client-pokprod"
294+
EOF
295+
```
296+
297+
Then reapply the helm chart, this will start a rollout update.
298+
299+
```bash
300+
helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f autopilot-extended.yaml
301+
```
184302

185303
### MLBatch Teams Setup
186304

0 commit comments

Comments
 (0)