The monitor component is an optional feature of HAMi DRA Webhook that collects and exposes GPU resource metrics via Prometheus.
The monitor component watches Kubernetes ResourceSlice and ResourceClaim resources, maintains an in-memory cache of GPU device allocations, and exposes Prometheus metrics for monitoring GPU resource usage across the cluster.
The monitor component is enabled by default when installing the Helm chart. To disable it:
helm install hami-dra ./charts/hami-dra \
--set monitor.enabled=falseConfigure the monitor in charts/hami-dra/values.yaml:
monitor:
enabled: true
replicas: 1
logLevel: 2
metricsBindAddress: ":8080"
healthProbeBindAddress: ":8000"
kubeAPIQPS: 40.0
kubeAPIBurst: 60
collectInterval: "30s"Configuration Parameters:
enabled: Enable or disable the monitor component (default:true)replicas: Number of monitor pod replicas (default:1)logLevel: Log verbosity level (default:2)metricsBindAddress: Address and port for metrics endpoint (default::8080)healthProbeBindAddress: Address and port for health probe endpoints (default::8000)kubeAPIQPS: QPS limit for Kubernetes API client (default:40.0)kubeAPIBurst: Burst limit for Kubernetes API client (default:60)collectInterval: Interval for metrics collection (default:30s)
The monitor service can be configured to use different service types depending on your access requirements.
Use ClusterIP for internal cluster access:
monitor:
enabled: true
service:
type: ClusterIPAccess metrics via port-forward:
kubectl port-forward svc/hami-dra-monitor 8080:8080 -n <namespace>
curl http://localhost:8080/metricsUse NodePort to expose metrics outside the cluster:
With specified ports:
monitor:
enabled: true
service:
type: NodePort
nodePort:
metrics: 30080 # NodePort for metrics endpointWith auto-assigned ports:
monitor:
enabled: true
service:
type: NodePort
nodePort:
metrics: "" # Kubernetes will assign a random portAccess metrics via NodePort:
# Get the NodePort
kubectl get svc hami-dra-monitor -n <namespace> -o jsonpath='{.spec.ports[?(@.name=="metrics")].nodePort}'
# Access metrics
curl http://<node-ip>:<nodeport>/metricsUse LoadBalancer for cloud provider load balancer integration:
monitor:
enabled: true
service:
type: LoadBalancerThe monitor exposes the following Prometheus metrics:
Device memory limit for a GPU (in MB).
Labels:
nodeid: Kubernetes node namedeviceuuid: GPU device UUIDdeviceidx: Device index on the nodedevicename: Device namedevicebrand: Device brand (e.g., Tesla)deviceproductname: Device product name (e.g., Tesla V100)
Example:
GPUDeviceMemoryLimit{nodeid="node1", deviceuuid="gpu-uuid-123", deviceidx="0", devicename="gpu0", devicebrand="Tesla", deviceproductname="Tesla V100"} 16000
Device core limit for a GPU.
Labels: Same as GPUDeviceMemoryLimit
Device memory currently allocated for a GPU (in MB).
Labels: Same as GPUDeviceMemoryLimit
Device cores currently allocated for a GPU.
Labels: Same as GPUDeviceMemoryLimit
vGPU device memory allocated for a container (in MB).
Labels:
nodeid: Kubernetes node namedeviceuuid: GPU device UUIDdeviceidx: Device index on the nodedevicename: Device namedevicebrand: Device branddeviceproductname: Device product namepodnamespace: Pod namespacepodname: Pod name
Example:
vGPUDeviceMemoryAllocated{nodeid="node1", deviceuuid="gpu-uuid-123", deviceidx="0", devicename="gpu0", devicebrand="Tesla", deviceproductname="Tesla V100", podnamespace="default", podname="my-pod"} 8000
vGPU device cores allocated for a container.
Labels: Same as vGPUDeviceMemoryAllocated
- Path:
/metrics - Port:
8080(configurable viametricsBindAddress) - Format: Prometheus text format
- Access:
http://<service-address>:8080/metrics
- Liveness Probe:
/healthzon port8000 - Readiness Probe:
/readyzon port8000 - Access:
http://<service-address>:8000/healthzor/readyz
The readiness probe returns 200 OK when the cache is synced and ready, 503 Service Unavailable otherwise.
To automatically discover and scrape metrics from the monitor, add the following to your Prometheus configuration:
scrape_configs:
- job_name: 'hami-dra-monitor'
kubernetes_sd_configs:
- role: service
namespaces:
names:
- <monitor-namespace> # Replace with your namespace
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: hami-dra-monitor
- source_labels: [__meta_kubernetes_service_port_name]
action: keep
regex: metricsAlternatively, you can use static configuration:
scrape_configs:
- job_name: 'hami-dra-monitor'
static_configs:
- targets:
- 'hami-dra-monitor.<namespace>.svc.cluster.local:8080'Default resource requests and limits:
monitor:
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 100m
memory: 128MiAdjust these values based on your cluster size and monitoring requirements.
# Check pod status
kubectl get pods -l app.kubernetes.io/component=monitor -n <namespace>
# Check logs
kubectl logs -l app.kubernetes.io/component=monitor -n <namespace>
# Check service
kubectl get svc hami-dra-monitor -n <namespace># Port-forward to access metrics
kubectl port-forward svc/hami-dra-monitor 8080:8080 -n <namespace>
# Check metrics
curl http://localhost:8080/metrics | grep GPUDeviceThe monitor requires the cache to be synced before it can collect metrics. Check the logs for:
Cache started and synced successfully
Cache is ready
If the cache fails to sync, check:
- RBAC permissions for ResourceSlice and ResourceClaim resources
- Network connectivity to the Kubernetes API server
- ResourceSlice and ResourceClaim resources exist in the cluster
The monitor component consists of:
- Cache Layer: Maintains an in-memory cache of ResourceSlice and ResourceClaim resources
- Metrics Collector: Implements Prometheus Collector interface to gather metrics from the cache
- HTTP Servers:
- Metrics server on port 8080
- Health probe server on port 8000
The monitor uses Kubernetes informers to watch ResourceSlice and ResourceClaim resources, ensuring the cache stays up-to-date with cluster state.
- Cache Sync: The monitor waits for ResourceSlice cache to sync before processing ResourceClaim events to ensure data consistency
- Concurrency: Uses node-level locking to minimize contention when updating device usage
- API Rate Limiting: Configure
kubeAPIQPSandkubeAPIBurstto control API server load - Metrics Collection: Metrics are collected on-demand when Prometheus scrapes the endpoint, not on a fixed interval