Skip to content

Latest commit

 

History

History
301 lines (219 loc) · 7.7 KB

File metadata and controls

301 lines (219 loc) · 7.7 KB

Monitor Component Documentation

The monitor component is an optional feature of HAMi DRA Webhook that collects and exposes GPU resource metrics via Prometheus.

Overview

The monitor component watches Kubernetes ResourceSlice and ResourceClaim resources, maintains an in-memory cache of GPU device allocations, and exposes Prometheus metrics for monitoring GPU resource usage across the cluster.

Installation

The monitor component is enabled by default when installing the Helm chart. To disable it:

helm install hami-dra ./charts/hami-dra \
--set monitor.enabled=false

Configuration

Basic Configuration

Configure the monitor in charts/hami-dra/values.yaml:

monitor:
  enabled: true
  replicas: 1
  logLevel: 2
  metricsBindAddress: ":8080"
  healthProbeBindAddress: ":8000"
  kubeAPIQPS: 40.0
  kubeAPIBurst: 60
  collectInterval: "30s"

Configuration Parameters:

  • enabled: Enable or disable the monitor component (default: true)
  • replicas: Number of monitor pod replicas (default: 1)
  • logLevel: Log verbosity level (default: 2)
  • metricsBindAddress: Address and port for metrics endpoint (default: :8080)
  • healthProbeBindAddress: Address and port for health probe endpoints (default: :8000)
  • kubeAPIQPS: QPS limit for Kubernetes API client (default: 40.0)
  • kubeAPIBurst: Burst limit for Kubernetes API client (default: 60)
  • collectInterval: Interval for metrics collection (default: 30s)

Service Configuration

The monitor service can be configured to use different service types depending on your access requirements.

ClusterIP (Default)

Use ClusterIP for internal cluster access:

monitor:
  enabled: true
  service:
    type: ClusterIP

Access metrics via port-forward:

kubectl port-forward svc/hami-dra-monitor 8080:8080 -n <namespace>
curl http://localhost:8080/metrics

NodePort

Use NodePort to expose metrics outside the cluster:

With specified ports:

monitor:
  enabled: true
  service:
    type: NodePort
    nodePort:
      metrics: 30080  # NodePort for metrics endpoint

With auto-assigned ports:

monitor:
  enabled: true
  service:
    type: NodePort
    nodePort:
      metrics: ""  # Kubernetes will assign a random port

Access metrics via NodePort:

# Get the NodePort
kubectl get svc hami-dra-monitor -n <namespace> -o jsonpath='{.spec.ports[?(@.name=="metrics")].nodePort}'

# Access metrics
curl http://<node-ip>:<nodeport>/metrics

LoadBalancer

Use LoadBalancer for cloud provider load balancer integration:

monitor:
  enabled: true
  service:
    type: LoadBalancer

Exposed Metrics

The monitor exposes the following Prometheus metrics:

Node-Level Metrics

GPUDeviceMemoryLimit

Device memory limit for a GPU (in MB).

Labels:

  • nodeid: Kubernetes node name
  • deviceuuid: GPU device UUID
  • deviceidx: Device index on the node
  • devicename: Device name
  • devicebrand: Device brand (e.g., Tesla)
  • deviceproductname: Device product name (e.g., Tesla V100)

Example:

GPUDeviceMemoryLimit{nodeid="node1", deviceuuid="gpu-uuid-123", deviceidx="0", devicename="gpu0", devicebrand="Tesla", deviceproductname="Tesla V100"} 16000

GPUDeviceCoreLimit

Device core limit for a GPU.

Labels: Same as GPUDeviceMemoryLimit

GPUDeviceMemoryAllocated

Device memory currently allocated for a GPU (in MB).

Labels: Same as GPUDeviceMemoryLimit

GPUDeviceCoreAllocated

Device cores currently allocated for a GPU.

Labels: Same as GPUDeviceMemoryLimit

Pod-Level Metrics

vGPUDeviceMemoryAllocated

vGPU device memory allocated for a container (in MB).

Labels:

  • nodeid: Kubernetes node name
  • deviceuuid: GPU device UUID
  • deviceidx: Device index on the node
  • devicename: Device name
  • devicebrand: Device brand
  • deviceproductname: Device product name
  • podnamespace: Pod namespace
  • podname: Pod name

Example:

vGPUDeviceMemoryAllocated{nodeid="node1", deviceuuid="gpu-uuid-123", deviceidx="0", devicename="gpu0", devicebrand="Tesla", deviceproductname="Tesla V100", podnamespace="default", podname="my-pod"} 8000

vGPUDeviceCoreAllocated

vGPU device cores allocated for a container.

Labels: Same as vGPUDeviceMemoryAllocated

Endpoints

Metrics Endpoint

  • Path: /metrics
  • Port: 8080 (configurable via metricsBindAddress)
  • Format: Prometheus text format
  • Access: http://<service-address>:8080/metrics

Health Check Endpoints

  • Liveness Probe: /healthz on port 8000
  • Readiness Probe: /readyz on port 8000
  • Access: http://<service-address>:8000/healthz or /readyz

The readiness probe returns 200 OK when the cache is synced and ready, 503 Service Unavailable otherwise.

Prometheus Integration

Service Discovery Configuration

To automatically discover and scrape metrics from the monitor, add the following to your Prometheus configuration:

scrape_configs:
  - job_name: 'hami-dra-monitor'
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names:
            - <monitor-namespace>  # Replace with your namespace
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: hami-dra-monitor
      - source_labels: [__meta_kubernetes_service_port_name]
        action: keep
        regex: metrics

Static Configuration

Alternatively, you can use static configuration:

scrape_configs:
  - job_name: 'hami-dra-monitor'
    static_configs:
      - targets:
        - 'hami-dra-monitor.<namespace>.svc.cluster.local:8080'

Resource Requirements

Default resource requests and limits:

monitor:
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 100m
      memory: 128Mi

Adjust these values based on your cluster size and monitoring requirements.

Troubleshooting

Check Monitor Status

# Check pod status
kubectl get pods -l app.kubernetes.io/component=monitor -n <namespace>

# Check logs
kubectl logs -l app.kubernetes.io/component=monitor -n <namespace>

# Check service
kubectl get svc hami-dra-monitor -n <namespace>

Verify Metrics Endpoint

# Port-forward to access metrics
kubectl port-forward svc/hami-dra-monitor 8080:8080 -n <namespace>

# Check metrics
curl http://localhost:8080/metrics | grep GPUDevice

Check Cache Sync Status

The monitor requires the cache to be synced before it can collect metrics. Check the logs for:

Cache started and synced successfully
Cache is ready

If the cache fails to sync, check:

  • RBAC permissions for ResourceSlice and ResourceClaim resources
  • Network connectivity to the Kubernetes API server
  • ResourceSlice and ResourceClaim resources exist in the cluster

Architecture

The monitor component consists of:

  1. Cache Layer: Maintains an in-memory cache of ResourceSlice and ResourceClaim resources
  2. Metrics Collector: Implements Prometheus Collector interface to gather metrics from the cache
  3. HTTP Servers:
    • Metrics server on port 8080
    • Health probe server on port 8000

The monitor uses Kubernetes informers to watch ResourceSlice and ResourceClaim resources, ensuring the cache stays up-to-date with cluster state.

Performance Considerations

  • Cache Sync: The monitor waits for ResourceSlice cache to sync before processing ResourceClaim events to ensure data consistency
  • Concurrency: Uses node-level locking to minimize contention when updating device usage
  • API Rate Limiting: Configure kubeAPIQPS and kubeAPIBurst to control API server load
  • Metrics Collection: Metrics are collected on-demand when Prometheus scrapes the endpoint, not on a fixed interval