Monitor Component Documentation

The monitor component is an optional feature of HAMi DRA Webhook that collects and exposes GPU resource metrics via Prometheus.

Overview

The monitor component watches Kubernetes ResourceSlice and ResourceClaim resources, maintains an in-memory cache of GPU device allocations, and exposes Prometheus metrics for monitoring GPU resource usage across the cluster.

Installation

The monitor component is enabled by default when installing the Helm chart. To disable it:

helm install hami-dra ./charts/hami-dra \
--set monitor.enabled=false

Configuration

Basic Configuration

Configure the monitor in charts/hami-dra/values.yaml:

monitor:
  enabled: true
  replicas: 1
  logLevel: 2
  metricsBindAddress: ":8080"
  healthProbeBindAddress: ":8000"
  kubeAPIQPS: 40.0
  kubeAPIBurst: 60
  collectInterval: "30s"

Configuration Parameters:

enabled: Enable or disable the monitor component (default: true)
replicas: Number of monitor pod replicas (default: 1)
logLevel: Log verbosity level (default: 2)
metricsBindAddress: Address and port for metrics endpoint (default: :8080)
healthProbeBindAddress: Address and port for health probe endpoints (default: :8000)
kubeAPIQPS: QPS limit for Kubernetes API client (default: 40.0)
kubeAPIBurst: Burst limit for Kubernetes API client (default: 60)
collectInterval: Interval for metrics collection (default: 30s)

Service Configuration

The monitor service can be configured to use different service types depending on your access requirements.

ClusterIP (Default)

Use ClusterIP for internal cluster access:

monitor:
  enabled: true
  service:
    type: ClusterIP

Access metrics via port-forward:

kubectl port-forward svc/hami-dra-monitor 8080:8080 -n <namespace>
curl http://localhost:8080/metrics

NodePort

Use NodePort to expose metrics outside the cluster:

With specified ports:

monitor:
  enabled: true
  service:
    type: NodePort
    nodePort:
      metrics: 30080  # NodePort for metrics endpoint

With auto-assigned ports:

monitor:
  enabled: true
  service:
    type: NodePort
    nodePort:
      metrics: ""  # Kubernetes will assign a random port

Access metrics via NodePort:

# Get the NodePort
kubectl get svc hami-dra-monitor -n <namespace> -o jsonpath='{.spec.ports[?(@.name=="metrics")].nodePort}'

# Access metrics
curl http://<node-ip>:<nodeport>/metrics

LoadBalancer

Use LoadBalancer for cloud provider load balancer integration:

monitor:
  enabled: true
  service:
    type: LoadBalancer

Exposed Metrics

The monitor exposes the following Prometheus metrics:

Node-Level Metrics

GPUDeviceMemoryLimit

Device memory limit for a GPU (in MB).

Labels:

nodeid: Kubernetes node name
deviceuuid: GPU device UUID
deviceidx: Device index on the node
devicename: Device name
devicebrand: Device brand (e.g., Tesla)
deviceproductname: Device product name (e.g., Tesla V100)

Example:

GPUDeviceMemoryLimit{nodeid="node1", deviceuuid="gpu-uuid-123", deviceidx="0", devicename="gpu0", devicebrand="Tesla", deviceproductname="Tesla V100"} 16000

GPUDeviceCoreLimit

Device core limit for a GPU.

Labels: Same as GPUDeviceMemoryLimit

GPUDeviceMemoryAllocated

Device memory currently allocated for a GPU (in MB).

Labels: Same as GPUDeviceMemoryLimit

GPUDeviceCoreAllocated

Device cores currently allocated for a GPU.

Labels: Same as GPUDeviceMemoryLimit

Pod-Level Metrics

vGPUDeviceMemoryAllocated

vGPU device memory allocated for a container (in MB).

Labels:

nodeid: Kubernetes node name
deviceuuid: GPU device UUID
deviceidx: Device index on the node
devicename: Device name
devicebrand: Device brand
deviceproductname: Device product name
podnamespace: Pod namespace
podname: Pod name

Example:

vGPUDeviceMemoryAllocated{nodeid="node1", deviceuuid="gpu-uuid-123", deviceidx="0", devicename="gpu0", devicebrand="Tesla", deviceproductname="Tesla V100", podnamespace="default", podname="my-pod"} 8000

vGPUDeviceCoreAllocated

vGPU device cores allocated for a container.

Labels: Same as vGPUDeviceMemoryAllocated

Endpoints

Metrics Endpoint

Path: /metrics
Port: 8080 (configurable via metricsBindAddress)
Format: Prometheus text format
Access: http://<service-address>:8080/metrics

Health Check Endpoints

Liveness Probe: /healthz on port 8000
Readiness Probe: /readyz on port 8000
Access: http://<service-address>:8000/healthz or /readyz

The readiness probe returns 200 OK when the cache is synced and ready, 503 Service Unavailable otherwise.

Prometheus Integration

Service Discovery Configuration

To automatically discover and scrape metrics from the monitor, add the following to your Prometheus configuration:

scrape_configs:
  - job_name: 'hami-dra-monitor'
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names:
            - <monitor-namespace>  # Replace with your namespace
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: hami-dra-monitor
      - source_labels: [__meta_kubernetes_service_port_name]
        action: keep
        regex: metrics

Static Configuration

Alternatively, you can use static configuration:

scrape_configs:
  - job_name: 'hami-dra-monitor'
    static_configs:
      - targets:
        - 'hami-dra-monitor.<namespace>.svc.cluster.local:8080'

Resource Requirements

Default resource requests and limits:

monitor:
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 128Mi

Adjust these values based on your cluster size and monitoring requirements.

Troubleshooting

Check Monitor Status

# Check pod status
kubectl get pods -l app.kubernetes.io/component=monitor -n <namespace>

# Check logs
kubectl logs -l app.kubernetes.io/component=monitor -n <namespace>

# Check service
kubectl get svc hami-dra-monitor -n <namespace>

Verify Metrics Endpoint

# Port-forward to access metrics
kubectl port-forward svc/hami-dra-monitor 8080:8080 -n <namespace>

# Check metrics
curl http://localhost:8080/metrics | grep GPUDevice

Check Cache Sync Status

The monitor requires the cache to be synced before it can collect metrics. Check the logs for:

Cache started and synced successfully
Cache is ready

If the cache fails to sync, check:

RBAC permissions for ResourceSlice and ResourceClaim resources
Network connectivity to the Kubernetes API server
ResourceSlice and ResourceClaim resources exist in the cluster

Architecture

The monitor component consists of:

Cache Layer: Maintains an in-memory cache of ResourceSlice and ResourceClaim resources
Metrics Collector: Implements Prometheus Collector interface to gather metrics from the cache
HTTP Servers:
- Metrics server on port 8080
- Health probe server on port 8000

The monitor uses Kubernetes informers to watch ResourceSlice and ResourceClaim resources, ensuring the cache stays up-to-date with cluster state.

Performance Considerations

Cache Sync: The monitor waits for ResourceSlice cache to sync before processing ResourceClaim events to ensure data consistency
Concurrency: Uses node-level locking to minimize contention when updating device usage
API Rate Limiting: Configure kubeAPIQPS and kubeAPIBurst to control API server load
Metrics Collection: Metrics are collected on-demand when Prometheus scrapes the endpoint, not on a fixed interval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor Component Documentation

Overview

Installation

Configuration

Basic Configuration

Service Configuration

ClusterIP (Default)

NodePort

LoadBalancer

Exposed Metrics

Node-Level Metrics

GPUDeviceMemoryLimit

GPUDeviceCoreLimit

GPUDeviceMemoryAllocated

GPUDeviceCoreAllocated

Pod-Level Metrics

vGPUDeviceMemoryAllocated

vGPUDeviceCoreAllocated

Endpoints

Metrics Endpoint

Health Check Endpoints

Prometheus Integration

Service Discovery Configuration

Static Configuration

Resource Requirements

Troubleshooting

Check Monitor Status

Verify Metrics Endpoint

Check Cache Sync Status

Architecture

Performance Considerations

FilesExpand file tree

MONITOR.md

Latest commit

History

MONITOR.md

File metadata and controls

Monitor Component Documentation

Overview

Installation

Configuration

Basic Configuration

Service Configuration

ClusterIP (Default)

NodePort

LoadBalancer

Exposed Metrics

Node-Level Metrics

GPUDeviceMemoryLimit

GPUDeviceCoreLimit

GPUDeviceMemoryAllocated

GPUDeviceCoreAllocated

Pod-Level Metrics

vGPUDeviceMemoryAllocated

vGPUDeviceCoreAllocated

Endpoints

Metrics Endpoint

Health Check Endpoints

Prometheus Integration

Service Discovery Configuration

Static Configuration

Resource Requirements

Troubleshooting

Check Monitor Status

Verify Metrics Endpoint

Check Cache Sync Status

Architecture

Performance Considerations