-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Prerequisites
- I searched existing issues
Feature Summary
Hey folks, don't know if it is in scope\roadmap for NVSentiel, but would like to start the discussion about infiniband health monitoring.
Infiniband is also a very important part of HPC and it's health and performance directly influence the workloads running on the nodes.
Problem/Use Case
We are using gpud together with node-problem-detector and draino for health-checking, node condition setting and draining and are planning to investigate NVSentiel as a single replacement for these components.
One of the great features that gpud supports is the ability to run infiniband checks and report interfaces that are either down or flap between active<>down states. This allows us to cordon the problematic nodes and evict the workloads from them until the problem is investigated and fixed.
Proposed Solution
Introduce a new infiniband-health-monitor (integrate gpud healthchecks logic?) for NVSentiel, that will implement ib ports down and flapping cheks.
Component
New Component