OKE NPD is an extension of https://github.com/kubernetes/node-problem-detector that processes GPU health check failures reported by GPU Scanner service and sets conditions on the affected nodes. This feature enables proactive monitoring of GPU node health and early detection of issues.
These actions should only be performed after a successful OCI GPU scanner service installation of control plane and data plane (plugin) components on individual GPU nodes.
❗❗IMPORTANT: NPD will only start processing GPU health check events when OCI GPU Scanner Service data plane plugin is actively running.
Label the target GPU nodes so they can host the NPD DaemonSet.
Only run this command on the GPU nodes that you would like to run the NPD feature on.
kubectl label node <nodeIP e.g 10.0.65.72> oci.oraclecloud.com/oke-node-problem-detector-enabled="true"Install the NPD DaemonSet.
kubectl apply -f https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/existing_cluster_deploy/oke-node-problem-detector.yamlVerify that NPD DaemonSet has been installed successfully and running.
kubectl get pods -l app=oke-node-problem-detector -o wide -n kube-systemResults should show oke-node-problem-detector in running state for all targeted GPU nodes.
To remove the NPD from an OKE cluster, run the below command
kubectl delete -f https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/existing_cluster_deploy/oke-node-problem-detector.yaml