Skip to content

[Bug]: metrics-access NetworkPolicy blocks DCGM host engine port 5555 when using GPU Operator #880

@harinik05

Description

@harinik05

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Bug Description

When NVSentinel is deployed with GPU Operator's dcgm host engine and gpu-health-monitor, the metrics-access NetworkPolicy doesn't allow ingress on DCGM host engine port (by default 5555). As a result, gpu-health-monitor pods cannot reach nvidia-dcgm.gpu-operator.svc and have reported connectivity issues until this network policy was patched.

Component

Health Monitor

Steps to Reproduce

When global.dcgm.enabled and gpu-health-monitor.dcgm.dcgmK8sServiceEnabled are set and DCGM service endpoint is configured, the chart's metrics-access network policy should allow ingress to that DCGM port so that gpu health monitor can talk to DCGM host engine.
Network Policy Link

Environment

  • NVSentinel version:
  • Kubernetes version:
  • Deployment method:

Logs/Output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions