Running Active Health Checks for GPU Nodes (Preview)

Note

This is a preview feature. Additional tests are being actively developed and added.

Active health checks provide automated, periodic validation of GPU & RDMA functionality on OKE nodes. These checks run as CronJobs that test GPU nodes during idle periods and apply labels indicating the health status of each node.

Overview

Available Health Check Types

Five types of active health checks are available:

NCCL Tests - Multi-node GPU communication tests using NVIDIA NCCL (NVIDIA GPUs)
RCCL Tests - Multi-node GPU communication tests using AMD RCCL (AMD GPUs)
GPU Fryer - Single-node GPU stress testing (NVIDIA GPUs)
RVS - Single-node GPU validation using ROCm Validation Suite (AMD GPUs)
DCGM Diagnostics - Host-level GPU diagnostics using NVIDIA DCGM (NVIDIA GPUs)

How It Works

Each health check runs as a CronJob that:

Identifies idle GPU nodes that have not been tested in the last 24 hours
Executes the appropriate test workload
Applies labels to nodes with pass/fail results and timestamps
Automatically cleans up completed jobs after 5 minutes

Prerequisites

OKE cluster with GPU nodes
kubectl access with cluster-admin privileges
Kueue installed
MPI Operator installed (for NCCL and RCCL tests)
Monitoring namespace (or permission to create it)

Architecture

Common Features

All health checks share these characteristics:

Low Priority: Use active-health-checks-low PriorityClass to avoid disrupting production workloads
Idle Node Selection: Only test nodes with zero GPU allocation
Daily Testing: Skip nodes already tested today (based on UTC date)
Automatic Labeling: Apply pass/fail labels and timestamps to tested nodes
Self-Cleaning: Jobs auto-delete after completion (TTL 5 minutes)
Hourly Schedule: Run every hour (configurable via schedule field)

Node Labels

Each health check applies two labels to tested nodes:

Health Check	Pass/Fail Label	Timestamp Label
NCCL Tests	`oke.oraclecloud.com/active-health-checks-nccl-tests`	`oke.oraclecloud.com/active-health-checks-nccl-tests-last-run`
RCCL Tests	`oke.oraclecloud.com/active-health-checks-rccl-tests`	`oke.oraclecloud.com/active-health-checks-rccl-tests-last-run`
GPU Fryer	`oke.oraclecloud.com/active-health-checks-gpu-fryer`	`oke.oraclecloud.com/active-health-checks-gpu-fryer-last-run`
RVS	`oke.oraclecloud.com/active-health-checks-rvs`	`oke.oraclecloud.com/active-health-checks-rvs-last-run`
DCGM Diagnostics	`oke.oraclecloud.com/active-health-checks-dcgm-diag`	`oke.oraclecloud.com/active-health-checks-dcgm-diag-last-run`

Label values:

Pass/Fail: pass or fail
Timestamp: ISO 8601 format with hyphens, e.g., 2025-10-01T14-30-00Z

RBAC Permissions

All five health checks use the same RBAC configuration:

ServiceAccount: active-health-checks-runner (in monitoring namespace)
ClusterRole: active-health-checks-runner-role

The RBAC permissions allow the health check jobs to:

List and describe nodes
Read pod information to determine GPU allocation
Label nodes with test results

Deployment

Step 1: Install Prerequisites

Install Kueue and MPI Operator (required for NCCL tests):

helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.14.2" --create-namespace --namespace=kueue-system

kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml

Step 2: Deploy Active Health Checks

Deploy all health check CronJobs:

For NVIDIA GPU clusters:

kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-gpu-fryer.yaml
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-dcgm-diag.yaml

For AMD GPU clusters:

kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rccl-tests.yaml
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rvs.yaml

Step 3: Verify Deployment

Check that the CronJobs have been created:

kubectl get cronjobs -n monitoring

Example output (NVIDIA GPU clusters):

NAME                                       SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
active-health-checks-dcgm-diag-applier     0 * * * *     False     0        <none>          10s
active-health-checks-gpu-fryer-applier     0 * * * *     False     0        <none>          10s
active-health-checks-nccl-tests-applier    0 * * * *     False     0        <none>          10s

Example output (AMD GPU clusters):

NAME                                       SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
active-health-checks-rccl-tests-applier    0 * * * *     False     0        <none>          10s
active-health-checks-rvs-applier           0 * * * *     False     0        <none>          10s

Node Selection Logic

All health checks follow this selection process:

Find GPU Nodes: Query nodes with appropriate GPU label
- NVIDIA tests: nvidia.com/gpu=true label
- AMD tests: amd.com/gpu=true label
Check Idle Status: Calculate GPU usage from pod requests
- Only nodes with 0 GPU allocation are considered
Check Last Run: Parse *-last-run timestamp label
- Skip nodes tested today (same UTC date)
Select Nodes:
- NCCL/RCCL: Pick 2+ nodes of same shape
- GPU Fryer: Pick 1 node
- RVS: Pick 1 node
- DCGM: Pick 1 node

This ensures:

Production workloads are never disrupted
Each node is tested at most once per day
Tests run on available capacity

Monitoring Health Check Results

View Node Labels

Check the health status labels on a specific node:

kubectl get node <node-name> --show-labels | grep active-health-checks

View all nodes with their health check labels:

For NVIDIA GPU nodes:

kubectl get nodes -o custom-columns=NAME:.metadata.name,NCCL:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-nccl-tests,GPU_FRYER:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-gpu-fryer,DCGM:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-dcgm-diag

For AMD GPU nodes:

kubectl get nodes -o custom-columns=NAME:.metadata.name,RCCL:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-rccl-tests,RVS:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-rvs

Identify Failed Nodes

List nodes that have failed any health check:

# NVIDIA GPU nodes
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-nccl-tests=fail -o wide
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-gpu-fryer=fail -o wide
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-dcgm-diag=fail -o wide

# AMD GPU nodes
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-rccl-tests=fail -o wide
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-rvs=fail -o wide

View Health Check Job Logs

Check the logs of recent health check jobs:

# List recent jobs
kubectl get jobs -n monitoring

# View logs from a specific job
kubectl logs -n monitoring job/<job-name>

Manual Test Execution

To manually trigger a health check outside the regular schedule:

# Create a one-off job from the CronJob
# NVIDIA GPU tests
kubectl create job -n monitoring manual-nccl-test --from=cronjob/active-health-checks-nccl-tests-applier
kubectl create job -n monitoring manual-fryer-test --from=cronjob/active-health-checks-gpu-fryer-applier
kubectl create job -n monitoring manual-dcgm-test --from=cronjob/active-health-checks-dcgm-diag-applier

# AMD GPU tests
kubectl create job -n monitoring manual-rccl-test --from=cronjob/active-health-checks-rccl-tests-applier
kubectl create job -n monitoring manual-rvs-test --from=cronjob/active-health-checks-rvs-applier

To run a test immediately on a specific node, you can temporarily modify the node labels to remove the last-run timestamp:

# For NVIDIA nodes
kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-nccl-tests-last-run-

# For AMD nodes
kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-rccl-tests-last-run-
kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-rvs-last-run-

The next CronJob execution will then select this node for testing.

Configuration

Adjusting Test Schedule

By default, health checks run every hour (0 * * * *). To modify the schedule:

Edit the CronJob:

kubectl edit cronjob active-health-checks-nccl-tests-applier -n monitoring

Update the schedule field to your desired cron expression.

Customizing Test Parameters

Each health check manifest can be customized with different parameters:

NCCL Tests: Number of nodes, GPU count, NCCL parameters
RCCL Tests: Number of nodes, GPU count, RCCL parameters
GPU Fryer: Stress duration, temperature thresholds
RVS: Test recipe, iterations, timeout, validation tests
DCGM Diagnostics: Diagnostic level, specific tests to run

Download and modify the manifests locally before applying them for custom configurations.

Suspending Health Checks

To temporarily disable health checks (e.g., during maintenance):

# Suspend a specific health check
kubectl patch cronjob active-health-checks-nccl-tests-applier -n monitoring -p '{"spec":{"suspend":true}}'

# Resume the health check
kubectl patch cronjob active-health-checks-nccl-tests-applier -n monitoring -p '{"spec":{"suspend":false}}'

Uninstalling

To remove active health checks:

For NVIDIA GPU clusters:

kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-gpu-fryer.yaml
kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-dcgm-diag.yaml

For AMD GPU clusters:

kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rccl-tests.yaml
kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rvs.yaml

Note

Node labels applied by health checks will remain after uninstalling. To remove them, manually delete the labels from each node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Active Health Checks for GPU Nodes (Preview)

Overview

Available Health Check Types

How It Works

Prerequisites

Architecture

Common Features

Node Labels

RBAC Permissions

Deployment

Step 1: Install Prerequisites

Step 2: Deploy Active Health Checks

Step 3: Verify Deployment

Node Selection Logic

Monitoring Health Check Results

View Node Labels

Identify Failed Nodes

View Health Check Job Logs

Manual Test Execution

Configuration

Adjusting Test Schedule

Customizing Test Parameters

Suspending Health Checks

Uninstalling

FilesExpand file tree

running-active-health-checks.md

Latest commit

History

running-active-health-checks.md

File metadata and controls

Running Active Health Checks for GPU Nodes (Preview)

Overview

Available Health Check Types

How It Works

Prerequisites

Architecture

Common Features

Node Labels

RBAC Permissions

Deployment

Step 1: Install Prerequisites

Step 2: Deploy Active Health Checks

Step 3: Verify Deployment

Node Selection Logic

Monitoring Health Check Results

View Node Labels

Identify Failed Nodes

View Health Check Job Logs

Manual Test Execution

Configuration

Adjusting Test Schedule

Customizing Test Parameters

Suspending Health Checks

Uninstalling