Feat: Add api to get machines with leaks#570
Feat: Add api to get machines with leaks#570srinivasadmurthy wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
| WHERE m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection' | ||
| AND EXISTS ( | ||
| SELECT 1 FROM jsonb_array_elements( | ||
| m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts' |
There was a problem hiding this comment.
Could you make sure there's an appropriate GIN index on the health_report_overrides column so that this doesn't require a sequential scan of all machines? It'll matter a lot for the larger environments.
There was a problem hiding this comment.
I'm less worried about the scan. What I'm more worried is that the implementation won't find any alerts that have been manually added by operators with a different override name. It doesn't search for the probe ID, it searches for the reporter.
There was a problem hiding this comment.
I'm worried about that and the scan. :-D
(But really, sequentially scanning through loads of JSON is measurably slow and we need to avoid it, especially if we're going to be monitoring this API at a high frequency.)
There was a problem hiding this comment.
This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.
There was a problem hiding this comment.
Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.
We could make that argument for every single workflow and add additional methods that need maintenance. But it's better to optimize when we determined it actually makes a difference.
Based on what I understand so far the cost of a FindMachineIds and GetHardwareLeaksReport call would be the same. And in the typical case they return 0 results.
If they would return a result, then a FindMachinesByIds call might add 1-100ms latency. But does that make any difference, given the frequency of the initial call we lower? And is it even required? If you just wanted to turn the machine off, then the MachineId should be enough information.
There was a problem hiding this comment.
Added a GIN index on machines table for health_report_overrides column
There was a problem hiding this comment.
@Matthias247 Currently, this is returning machine id and current power state, so that we don't turn off machines that are already off. So I will change it so that I add a filter to FindMachineIds that returns machine ids of machines with leaks and which are on. That will return just the machine ids that need to be taken care of. Thanks for the suggestion.
crates/api-db/src/machine.rs
Outdated
| let query = r#" | ||
| SELECT m.id, m.health_report_overrides, po.last_fetched_power_state | ||
| FROM machines m | ||
| LEFT JOIN power_options po ON m.id = po.host_id | ||
| WHERE m.id = ANY($1::varchar[]) | ||
| AND m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection' | ||
| AND EXISTS ( | ||
| SELECT 1 FROM jsonb_array_elements( | ||
| m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts' | ||
| ) AS alert | ||
| WHERE alert->'classifications' ? 'Leak' | ||
| ) | ||
| "#.to_string(); |
There was a problem hiding this comment.
I think you could still use the base query here with simple concatenation with AND m.id = ANY($1::varchar[]), eg:
| let query = r#" | |
| SELECT m.id, m.health_report_overrides, po.last_fetched_power_state | |
| FROM machines m | |
| LEFT JOIN power_options po ON m.id = po.host_id | |
| WHERE m.id = ANY($1::varchar[]) | |
| AND m.health_report_overrides->'merges' ? 'hardware-health.tray-leak-detection' | |
| AND EXISTS ( | |
| SELECT 1 FROM jsonb_array_elements( | |
| m.health_report_overrides->'merges'->'hardware-health.tray-leak-detection'->'alerts' | |
| ) AS alert | |
| WHERE alert->'classifications' ? 'Leak' | |
| ) | |
| "#.to_string(); | |
| lazy_static! { | |
| static ref query: String = format!( | |
| "{} AND m.id = ANY($1::varchar[])", | |
| LEAK_DETECTION_QUERY_BASE, | |
| ); | |
| } |
crates/api/src/handlers/health.rs
Outdated
| let leakt_reports = machines_with_leaks | ||
| .into_iter() | ||
| .filter_map(|(machine_id, power_state, overrides)| { | ||
| let report = overrides?.merges.get(TRAY_LEAK_DETECTION_SOURCE).cloned()?; |
There was a problem hiding this comment.
Nit: You can avoid the clone by removing from the map (which doesn't rehash the map or anything, .remove() is fast because it just leaves a placeholder behind) since the overrides are owned here and not used again:
| let report = overrides?.merges.get(TRAY_LEAK_DETECTION_SOURCE).cloned()?; | |
| let report = overrides?.merges.remove(TRAY_LEAK_DETECTION_SOURCE)?; |
crates/api/src/handlers/health.rs
Outdated
| .collect(); | ||
|
|
||
| Ok(Response::new(rpc::HardwareLeaksReportResponse { | ||
| leakt_reports, |
There was a problem hiding this comment.
Nit: typo
| leakt_reports, | |
| leak_reports, |
crates/rpc/proto/forge.proto
Outdated
|
|
||
| // Returns a list of leak_reports with leak type of alert in them | ||
| message HardwareLeaksReportResponse { | ||
| repeated HardwareMachineLeaks leakt_reports = 1; |
There was a problem hiding this comment.
Typo:
| repeated HardwareMachineLeaks leakt_reports = 1; | |
| repeated HardwareMachineLeaks leak_reports = 1; |
crates/health/Cargo.toml
Outdated
| default = [] | ||
| bench-hooks = [] | ||
| cpu2temp_alert = [] | ||
| leak_alert = [] |
There was a problem hiding this comment.
Please don't use crate features for these, they make the code a lot harder to maintain... we should basically never use crate features except for very particular reasons, see https://github.com/NVIDIA/bare-metal-manager-core/blob/main/STYLE_GUIDE.md#crate-features
What you could do instead is add some mock overrides under #[cfg(test)], something like:
pub struct LeakEventProcessor {
minimum_alerts_per_report: usize,
#[cfg(test)]
pub mock_alerts: Option<Vec<HealthReportAlert>>,
}
impl LeakEventProcessor {
pub fn new(minimum_alerts_per_report: usize) -> Self {
Self {
minimum_alerts_per_report,
#[cfg(test)]
mock_alerts: None,
}
}
// ...
}
Then in process_event:
let leak_alerts: Vec<&HealthReportAlert> = report
.alerts
.iter()
.filter(|alert| is_leak_detector_alert(alert))
.collect();
#[cfg(test)]
let leak_alerts = if let Some(mock_alerts) = self.mock_alerts.as_ref() {
[leak_alerts, mock_alerts.iter().collect()].concat()
} else {
leak_alerts
};
Then in any tests which need to assert on alert behavior, you can do leak_event_processor.mock_alerts = Some(vec![...]) to inject mock alerts and test the behavior that way.
There was a problem hiding this comment.
These are not for unit tests. I used this in dev env on my SCN node to generate alerts. I think I will just clean up the code.
Matthias247
left a comment
There was a problem hiding this comment.
I don't know about the exact use-case for this.
But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.
Description
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes
Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.