Skip to content

[Feature]: Infiniband health monitor #642

@MaxFedotov

Description

@MaxFedotov

Prerequisites

  • I searched existing issues

Feature Summary

Hey folks, don't know if it is in scope\roadmap for NVSentiel, but would like to start the discussion about infiniband health monitoring.

Infiniband is also a very important part of HPC and it's health and performance directly influence the workloads running on the nodes.

Problem/Use Case

We are using gpud together with node-problem-detector and draino for health-checking, node condition setting and draining and are planning to investigate NVSentiel as a single replacement for these components.

One of the great features that gpud supports is the ability to run infiniband checks and report interfaces that are either down or flap between active<>down states. This allows us to cordon the problematic nodes and evict the workloads from them until the problem is investigated and fixed.

Proposed Solution

Introduce a new infiniband-health-monitor (integrate gpud healthchecks logic?) for NVSentiel, that will implement ib ports down and flapping cheks.

Component

New Component

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions