Feat: Add api to get machines with leaks by srinivasadmurthy · Pull Request #570 · NVIDIA/ncx-infra-controller-core

srinivasadmurthy · 2026-03-16T05:46:26Z

Description

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.

copy-pr-bot · 2026-03-16T05:46:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kensimon · 2026-03-16T13:50:03Z

crates/api-db/src/machine.rs

+    WHERE m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection'
+    AND EXISTS (
+        SELECT 1 FROM jsonb_array_elements(
+            m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts'


Could you make sure there's an appropriate GIN index on the health_report_overrides column so that this doesn't require a sequential scan of all machines? It'll matter a lot for the larger environments.

I'm less worried about the scan. What I'm more worried is that the implementation won't find any alerts that have been manually added by operators with a different override name. It doesn't search for the probe ID, it searches for the reporter.

I'm worried about that and the scan. :-D

(But really, sequentially scanning through loads of JSON is measurably slow and we need to avoid it, especially if we're going to be monitoring this API at a high frequency.)

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

We could make that argument for every single workflow and add additional methods that need maintenance. But it's better to optimize when we determined it actually makes a difference.

Based on what I understand so far the cost of a FindMachineIds and GetHardwareLeaksReport call would be the same. And in the typical case they return 0 results.
If they would return a result, then a FindMachinesByIds call might add 1-100ms latency. But does that make any difference, given the frequency of the initial call we lower? And is it even required? If you just wanted to turn the machine off, then the MachineId should be enough information.

Added a GIN index on machines table for health_report_overrides column

@Matthias247 Currently, this is returning machine id and current power state, so that we don't turn off machines that are already off. So I will change it so that I add a filter to FindMachineIds that returns machine ids of machines with leaks and which are on. That will return just the machine ids that need to be taken care of. Thanks for the suggestion.

kensimon · 2026-03-16T13:54:29Z

crates/api-db/src/machine.rs

+        let query = r#"
+            SELECT m.id, m.health_report_overrides, po.last_fetched_power_state
+            FROM machines m
+            LEFT JOIN power_options po ON m.id = po.host_id
+            WHERE m.id = ANY($1::varchar[])
+            AND m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection'
+            AND EXISTS (
+                SELECT 1 FROM jsonb_array_elements(
+                    m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts'
+                ) AS alert
+                WHERE alert->'classifications' ? 'Leak'
+            )
+            "#.to_string();


I think you could still use the base query here with simple concatenation with AND m.id = ANY($1::varchar[]), eg:

Suggested change

let query = r#"

SELECT m.id, m.health_report_overrides, po.last_fetched_power_state

FROM machines m

LEFT JOIN power_options po ON m.id = po.host_id

WHERE m.id = ANY($1::varchar[])

AND m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection'

AND EXISTS (

SELECT 1 FROM jsonb_array_elements(

m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts'

) AS alert

WHERE alert->'classifications' ? 'Leak'

)

"#.to_string();

lazy_static! {

static ref query: String = format!(

"{} AND m.id = ANY($1::varchar[])",

LEAK_DETECTION_QUERY_BASE,

);

}

kensimon · 2026-03-16T14:01:14Z

crates/api/src/handlers/health.rs

+    let leakt_reports = machines_with_leaks
+        .into_iter()
+        .filter_map(|(machine_id, power_state, overrides)| {
+            let report = overrides?.merges.get(TRAY_LEAK_DETECTION_SOURCE).cloned()?;


Nit: You can avoid the clone by removing from the map (which doesn't rehash the map or anything, .remove() is fast because it just leaves a placeholder behind) since the overrides are owned here and not used again:

Suggested change

let report = overrides?.merges.get(TRAY_LEAK_DETECTION_SOURCE).cloned()?;

let report = overrides?.merges.remove(TRAY_LEAK_DETECTION_SOURCE)?;

kensimon · 2026-03-16T14:02:00Z

crates/api/src/handlers/health.rs

+        .collect();
+
+    Ok(Response::new(rpc::HardwareLeaksReportResponse {
+        leakt_reports,


Nit: typo

Suggested change

leakt_reports,

leak_reports,

kensimon · 2026-03-16T14:02:39Z

crates/rpc/proto/forge.proto

+
+// Returns a list of leak_reports with leak type of alert in them
+message HardwareLeaksReportResponse {
+  repeated HardwareMachineLeaks leakt_reports = 1;


Typo:

Suggested change

repeated HardwareMachineLeaks leakt_reports = 1;

repeated HardwareMachineLeaks leak_reports = 1;

kensimon · 2026-03-16T14:10:58Z

crates/health/Cargo.toml

+default = []
 bench-hooks = []
+cpu2temp_alert = []
+leak_alert = []


Please don't use crate features for these, they make the code a lot harder to maintain... we should basically never use crate features except for very particular reasons, see https://github.com/NVIDIA/bare-metal-manager-core/blob/main/STYLE_GUIDE.md#crate-features

What you could do instead is add some mock overrides under #[cfg(test)], something like:

pub struct LeakEventProcessor { minimum_alerts_per_report: usize, #[cfg(test)] pub mock_alerts: Option<Vec<HealthReportAlert>>, } impl LeakEventProcessor { pub fn new(minimum_alerts_per_report: usize) -> Self { Self { minimum_alerts_per_report, #[cfg(test)] mock_alerts: None, } } // ... }

Then in process_event:

let leak_alerts: Vec<&HealthReportAlert> = report .alerts .iter() .filter(|alert| is_leak_detector_alert(alert)) .collect(); #[cfg(test)] let leak_alerts = if let Some(mock_alerts) = self.mock_alerts.as_ref() { [leak_alerts, mock_alerts.iter().collect()].concat() } else { leak_alerts };

Then in any tests which need to assert on alert behavior, you can do leak_event_processor.mock_alerts = Some(vec![...]) to inject mock alerts and test the behavior that way.

These are not for unit tests. I used this in dev env on my SCN node to generate alerts. I think I will just clean up the code.

Matthias247

I don't know about the exact use-case for this.

But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.

Feat: Add api to get machines with leaks

41e6276

srinivasadmurthy requested a review from a team as a code owner March 16, 2026 05:46

srinivasadmurthy requested review from FrankSpitulski and yoks March 16, 2026 05:46

kensimon requested changes Mar 16, 2026

View reviewed changes

Matthias247 reviewed Mar 16, 2026

View reviewed changes

srinivasadmurthy added 2 commits March 16, 2026 23:47

Feat: Add api to get machines with leaks

8ba7ebe

Merge remote-tracking branch 'remotes/origin/main' into sdmrlav2

d9b5943

	let report = overrides?.merges.get(TRAY_LEAK_DETECTION_SOURCE).cloned()?;
	let report = overrides?.merges.remove(TRAY_LEAK_DETECTION_SOURCE)?;

	repeated HardwareMachineLeaks leakt_reports = 1;
	repeated HardwareMachineLeaks leak_reports = 1;

Conversation

srinivasadmurthy commented Mar 16, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot bot commented Mar 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srinivasadmurthy Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srinivasadmurthy Mar 16, 2026 •

edited

Loading