Prerequisites
Bug Description
When NVSentinel is deployed with GPU Operator's dcgm host engine and gpu-health-monitor, the metrics-access NetworkPolicy doesn't allow ingress on DCGM host engine port (by default 5555). As a result, gpu-health-monitor pods cannot reach nvidia-dcgm.gpu-operator.svc and have reported connectivity issues until this network policy was patched.
Component
Health Monitor
Steps to Reproduce
When global.dcgm.enabled and gpu-health-monitor.dcgm.dcgmK8sServiceEnabled are set and DCGM service endpoint is configured, the chart's metrics-access network policy should allow ingress to that DCGM port so that gpu health monitor can talk to DCGM host engine.
Network Policy Link
Environment
- NVSentinel version:
- Kubernetes version:
- Deployment method:
Logs/Output
No response