Skip to content

Commit 40d7341

Browse files
committed
fix: add kube-prometheus-stack as gpu-operator dependency
GPU operator with dcgmExporter.serviceMonitor.enabled=true tries to create a ServiceMonitor CR during reconciliation. If kube-prometheus-stack (which provides the ServiceMonitor CRD) is not yet installed, the operator reports ClusterPolicy as notReady with "couldn't find ServiceMonitor CRD". Add kube-prometheus-stack as a dependency of gpu-operator in base.yaml to ensure the monitoring stack and its CRDs are installed first. Fixes: #165 Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent f9f1ec0 commit 40d7341

File tree

5 files changed

+6
-3
lines changed

5 files changed

+6
-3
lines changed

examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,7 @@ componentRefs:
152152

153153
deploymentOrder:
154154
- cert-manager
155+
- kube-prometheus-stack
155156
- gpu-operator
156157
- nvidia-dra-driver-gpu
157158
- nvsentinel

examples/recipes/eks-training.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ componentRefs:
6363
valuesFile: components/skyhook-operator/values.yaml
6464
deploymentOrder:
6565
- cert-manager
66+
- kube-prometheus-stack
6667
- gpu-operator
6768
- nvsentinel
6869
- skyhook-operator

recipes/overlays/base.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ spec:
5050
namespace: gpu-operator
5151
dependencyRefs:
5252
- cert-manager
53+
- kube-prometheus-stack
5354

5455
- name: nvsentinel
5556
type: Helm

tests/chainsaw/ai-conformance/offline/assert-recipe.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,11 +57,11 @@ deploymentOrder:
5757
- aws-efa
5858
- cert-manager
5959
- dynamo-crds
60-
- gpu-operator
61-
- kai-scheduler
6260
- kgateway-crds
6361
- kgateway
6462
- kube-prometheus-stack
63+
- gpu-operator
64+
- kai-scheduler
6565
- dynamo-platform
6666
- k8s-ephemeral-storage-metrics
6767
- nvidia-dra-driver-gpu

tests/chainsaw/cli/cuj1-training/assert-recipe.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,9 @@ deploymentOrder:
5252
- aws-ebs-csi-driver
5353
- aws-efa
5454
- cert-manager
55+
- kube-prometheus-stack
5556
- gpu-operator
5657
- kai-scheduler
57-
- kube-prometheus-stack
5858
- k8s-ephemeral-storage-metrics
5959
- kubeflow-trainer
6060
- nvidia-dra-driver-gpu

0 commit comments

Comments
 (0)