fix: Add IbPortDown alert for machines with down IB ports by hasayesh · Pull Request #519 · NVIDIA/ncx-infra-controller-core

hasayesh · 2026-03-11T04:25:33Z

Description

When the IB Fabric Monitor detects ports not in Active state, it now sets a PreventAllocations health alert on the affected machine. This prevents Carbide from attempting to allocate instances on machines with degraded IB connectivity, avoiding SRE alerts.

Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down()
Detect ports not in Active state during IB fabric monitoring
Set/clear IbPortDown health alert via health report overrides
Update existing test to expect health alert blocking

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

github-actions · 2026-03-11T04:27:26Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-03-11 04:27:25 UTC | Commit: eb88f65}

github-actions · 2026-03-11T04:27:39Z

🛡️ Vulnerability Scan

🚨 Found 72 vulnerability(ies)
📊 vs main: 72 (no change)

Severity Breakdown:

🔴 Critical/High: 72
🟡 Medium: 0
🔵 Low/Info: 0

🔗 View full details in Security tab

_{🕐 Last updated: 2026-03-17 01:41:37 UTC | Commit: e4e83f8}

Matthias247 · 2026-03-11T16:58:24Z

It needs to take the SKU into account. There's hosts which intentionally have multiple ports disconnected. And if they would all have PreventAllocations set, none of them would be usable anymore.

hasayesh · 2026-03-11T22:32:21Z

This is what I suggest:
Update the L40 machines to have 2 IB's in SKU
Then we go from most to least specific:
Then if the machine has been assigned an instance-type use that
If not use SKU
if SKU does not exist (during early deployment) no check.

Please let me know if this is reasonable.

hasayesh · 2026-03-11T22:54:07Z

This is what I suggest: Update the L40 machines to have 2 IB's in SKU Then we go from most to least specific: Then if the machine has been assigned an instance-type use that If not use SKU if SKU does not exist (during early deployment) no check.

Please let me know if this is reasonable.

OK my bad, json shows the properly populated inactive. Will update as such. Thanks.

crates/api/src/ib_fabric_monitor/mod.rs

kensimon · 2026-03-12T16:12:20Z

crates/api/src/ib_fabric_monitor/mod.rs

+///
+/// Returns:
+/// - `Ok(None)` if no SKU is assigned, SKU not found, or SKU has no infiniband_devices defined
+///   (in these cases, IB port down monitoring should be skipped)


How come we don't want to monitor if we don't know the SKU of the machine?

What if we just returned an empty set in this case? That would mean "this machine does not expect any ports to be inactive", and we'd alert if the port is down. I feel that would be a more "safe" default if we don't know about the machine SKU, than avoiding alerting altogether.

(Maybe this question is for @Matthias247 who suggested checking the SKU. How do you feel about the default behavior if the SKU isn't found?)

I looked at the two flags:

bom_validation.enabled
allow_allocation_on_validation_failure

If the operator explicitly skips bom_validation.enabled the point is not to do it correct? To me they both should be enabled, if the strict behavior is required otherwise what is the point of these?

I think we're talking about different things.

I'm talking about what the fallback behavior is if we see an IB port down, but we're in an environment without BOM validation. I don't think we should skip alerting if that's the case.

To me, skipping BOM validation doesn't mean you don't care about whether IB ports are down. Unless I'm missing something?

What if we just returned an empty set in this case? That would mean "this machine does not expect any ports to be inactive", and we'd alert if the port is down.

I know of a few deployments which don't have SKUs/BOMs set up, but where not all ports are connected. For these the alarms would go off.
If we wanted to avoid that, then only taking into account SKUs if actually defined is ok.
But maybe this would also be a good way to "motivate" operators to setup SKUs?

I'd leave the decision up to @ajf and SRE teams.

crates/api/src/ib_fabric_monitor/mod.rs

Matthias247

Haven't looked at the implementation in-depth. I think @kensimon already did look deeper or can support more.

One question is probably whether the health alert is set in the IB monitor or in the main state handler. E.g. we currently set the "dpu heartbeattimeout" one in the main state handler. Both will work, and I personally don't have a big favorite (but maybe @kensimon or @chet might have?).

If you implement the optimization pointed out in the other comment then the check inside the monitor might be a bit more efficient since we don't need to load SKU data in the state handler.

crates/api/src/ib_fabric_monitor/mod.rs

@kensimon

When the IB Fabric Monitor detects ports not in Active state, it now sets a PreventAllocations health alert on the affected machine. This prevents Carbide from attempting to allocate instances on machines with degraded IB connectivity, avoiding SRE alerts. - Add HealthProbeId::ib_port_down() and HealthProbeAlert::ib_port_down() - Detect ports not in Active state during IB fabric monitoring - Set/clear IbPortDown health alert via health report overrides - Update existing test to expect health alert blocking ## Type of Change  - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional)  ## Breaking Changes - [ ] This PR contains breaking changes  ## Testing  - [x] Unit tests added/updated - [x] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes  Signed-off-by: Hamid Asayesh <hasayesh@nvidia.com> Apply suggestion from @kensimon Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Hamid Asayesh <162524665+hasayesh@users.noreply.github.com> Apply suggestion from @kensimon Signed-off-by: Hamid Asayesh <162524665+hasayesh@users.noreply.github.com>

hasayesh requested a review from a team as a code owner March 11, 2026 04:25

hasayesh requested a review from Matthias247 March 11, 2026 16:51

hasayesh requested a review from wminckler March 11, 2026 22:22

hasayesh force-pushed the nvbug-5866723 branch from eb88f65 to 6f6c9ec Compare March 12, 2026 03:35

kensimon reviewed Mar 12, 2026

View reviewed changes

hasayesh force-pushed the nvbug-5866723 branch from 6f6c9ec to 6d99656 Compare March 12, 2026 20:07

Matthias247 reviewed Mar 16, 2026

View reviewed changes

crates/api/src/ib_fabric_monitor/mod.rs Outdated Show resolved Hide resolved

hasayesh force-pushed the nvbug-5866723 branch from 15d0f47 to e4e83f8 Compare March 17, 2026 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add IbPortDown alert for machines with down IB ports#519

fix: Add IbPortDown alert for machines with down IB ports#519
hasayesh wants to merge 1 commit intoNVIDIA:mainfrom
hasayesh:nvbug-5866723

hasayesh commented Mar 11, 2026 •

edited by kensimon

Loading

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Matthias247 commented Mar 11, 2026

Uh oh!

hasayesh commented Mar 11, 2026

Uh oh!

hasayesh commented Mar 11, 2026

Uh oh!

Uh oh!

kensimon Mar 12, 2026

Uh oh!

hasayesh Mar 12, 2026 •

edited

Loading

Uh oh!

kensimon Mar 16, 2026

Uh oh!

Matthias247 Mar 16, 2026

Uh oh!

Uh oh!

Matthias247 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hasayesh commented Mar 11, 2026 • edited by kensimon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

github-actions bot commented Mar 11, 2026

🔐 TruffleHog Secret Scan

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Vulnerability Scan

Uh oh!

Matthias247 commented Mar 11, 2026

Uh oh!

hasayesh commented Mar 11, 2026

Uh oh!

hasayesh commented Mar 11, 2026

Uh oh!

Uh oh!

kensimon Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

hasayesh Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kensimon Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Matthias247 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hasayesh commented Mar 11, 2026 •

edited by kensimon

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading

hasayesh Mar 12, 2026 •

edited

Loading