Skip to content

Feat: Add api to get machines with leaks#570

Open
srinivasadmurthy wants to merge 3 commits intoNVIDIA:mainfrom
srinivasadmurthy:sdmrlav2
Open

Feat: Add api to get machines with leaks#570
srinivasadmurthy wants to merge 3 commits intoNVIDIA:mainfrom
srinivasadmurthy:sdmrlav2

Conversation

@srinivasadmurthy
Copy link
Contributor

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.

@srinivasadmurthy srinivasadmurthy requested a review from a team as a code owner March 16, 2026 05:46
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

WHERE m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection'
AND EXISTS (
SELECT 1 FROM jsonb_array_elements(
m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make sure there's an appropriate GIN index on the health_report_overrides column so that this doesn't require a sequential scan of all machines? It'll matter a lot for the larger environments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm less worried about the scan. What I'm more worried is that the implementation won't find any alerts that have been manually added by operators with a different override name. It doesn't search for the probe ID, it searches for the reporter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about that and the scan. :-D

(But really, sequentially scanning through loads of JSON is measurably slow and we need to avoid it, especially if we're going to be monitoring this API at a high frequency.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

We could make that argument for every single workflow and add additional methods that need maintenance. But it's better to optimize when we determined it actually makes a difference.

Based on what I understand so far the cost of a FindMachineIds and GetHardwareLeaksReport call would be the same. And in the typical case they return 0 results.
If they would return a result, then a FindMachinesByIds call might add 1-100ms latency. But does that make any difference, given the frequency of the initial call we lower? And is it even required? If you just wanted to turn the machine off, then the MachineId should be enough information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a GIN index on machines table for health_report_overrides column

Copy link
Contributor Author

@srinivasadmurthy srinivasadmurthy Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Matthias247 Currently, this is returning machine id and current power state, so that we don't turn off machines that are already off. So I will change it so that I add a filter to FindMachineIds that returns machine ids of machines with leaks and which are on. That will return just the machine ids that need to be taken care of. Thanks for the suggestion.

Comment on lines +1069 to +1081
let query = r#"
SELECT m.id, m.health_report_overrides, po.last_fetched_power_state
FROM machines m
LEFT JOIN power_options po ON m.id = po.host_id
WHERE m.id = ANY($1::varchar[])
AND m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection'
AND EXISTS (
SELECT 1 FROM jsonb_array_elements(
m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts'
) AS alert
WHERE alert->'classifications' ? 'Leak'
)
"#.to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could still use the base query here with simple concatenation with AND m.id = ANY($1::varchar[]), eg:

Suggested change
let query = r#"
SELECT m.id, m.health_report_overrides, po.last_fetched_power_state
FROM machines m
LEFT JOIN power_options po ON m.id = po.host_id
WHERE m.id = ANY($1::varchar[])
AND m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection'
AND EXISTS (
SELECT 1 FROM jsonb_array_elements(
m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts'
) AS alert
WHERE alert->'classifications' ? 'Leak'
)
"#.to_string();
lazy_static! {
static ref query: String = format!(
"{} AND m.id = ANY($1::varchar[])",
LEAK_DETECTION_QUERY_BASE,
);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

let leakt_reports = machines_with_leaks
.into_iter()
.filter_map(|(machine_id, power_state, overrides)| {
let report = overrides?.merges.get(TRAY_LEAK_DETECTION_SOURCE).cloned()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You can avoid the clone by removing from the map (which doesn't rehash the map or anything, .remove() is fast because it just leaves a placeholder behind) since the overrides are owned here and not used again:

Suggested change
let report = overrides?.merges.get(TRAY_LEAK_DETECTION_SOURCE).cloned()?;
let report = overrides?.merges.remove(TRAY_LEAK_DETECTION_SOURCE)?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.collect();

Ok(Response::new(rpc::HardwareLeaksReportResponse {
leakt_reports,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typo

Suggested change
leakt_reports,
leak_reports,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Returns a list of leak_reports with leak type of alert in them
message HardwareLeaksReportResponse {
repeated HardwareMachineLeaks leakt_reports = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
repeated HardwareMachineLeaks leakt_reports = 1;
repeated HardwareMachineLeaks leak_reports = 1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

default = []
bench-hooks = []
cpu2temp_alert = []
leak_alert = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use crate features for these, they make the code a lot harder to maintain... we should basically never use crate features except for very particular reasons, see https://github.com/NVIDIA/bare-metal-manager-core/blob/main/STYLE_GUIDE.md#crate-features

What you could do instead is add some mock overrides under #[cfg(test)], something like:

pub struct LeakEventProcessor {
    minimum_alerts_per_report: usize,
    #[cfg(test)]
    pub mock_alerts: Option<Vec<HealthReportAlert>>,
}

impl LeakEventProcessor {
    pub fn new(minimum_alerts_per_report: usize) -> Self {
        Self {
            minimum_alerts_per_report,
            #[cfg(test)]
            mock_alerts: None,
        }
    }

    // ...
}

Then in process_event:

        let leak_alerts: Vec<&HealthReportAlert> = report
            .alerts
            .iter()
            .filter(|alert| is_leak_detector_alert(alert))
            .collect();

        #[cfg(test)]
        let leak_alerts = if let Some(mock_alerts) = self.mock_alerts.as_ref() {
            [leak_alerts, mock_alerts.iter().collect()].concat()
        } else {
            leak_alerts
        };

Then in any tests which need to assert on alert behavior, you can do leak_event_processor.mock_alerts = Some(vec![...]) to inject mock alerts and test the behavior that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not for unit tests. I used this in dev env on my SCN node to generate alerts. I think I will just clean up the code.

Copy link
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the exact use-case for this.

But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants