Conversation
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-03-11 21:51:53 UTC | Commit: 8e7ef0a |
🛡️ Vulnerability Scan🚨 Found 72 vulnerability(ies) Severity Breakdown:
🔗 View full details in Security tab 🕐 Last updated: 2026-03-16 20:50:28 UTC | Commit: 384568b |
Matthias247
left a comment
There was a problem hiding this comment.
super quick pass-through.
The ExpectedRack definition, DB model and associated APIs look fine to me.
RackTypeDefinition is something we could think about further. But we can improve it in follow-up PRs too.
|
|
||
| /// Human-readable description of this rack type. | ||
| pub description: String, | ||
|
|
There was a problem hiding this comment.
I thing we might need more info than just the count. E.g. also a model type, in case different components require different handling?
Maybe we model it more like machine_capabilities inside the SKU?
There was a problem hiding this comment.
Yup, love this idea. I'll do this now instead of taking an iterative approach on it. Brb!
crates/api-model/src/rack.rs
Outdated
| /// rack_type_definition is the full rack type definition embedded at | ||
| /// the time the expected rack is created or updated. This avoids | ||
| /// runtime lookups against the config file and ensures the rack | ||
| /// retains its definition even if the config changes later. |
There was a problem hiding this comment.
I guess its always debatable whether we want consistency (updating the type definition gives you the latest everywhere) or whether we want to keep old values also valid.
I'd probably lean towards consistency (load latest RackTypeDefinition) here - mostly because thats the same that we do for SKUs. Nice thing is that it allows us to add new properties to a rack type definition later.
| let expected_rack = request.into_inner(); | ||
| let rack_id = expected_rack | ||
| .rack_id | ||
| .ok_or_else(|| Status::invalid_argument("rack_id is required"))?; |
There was a problem hiding this comment.
use CarbideError::InvalidArgument/NotFound/etc and convert to tonic::Status for all error path (for consistent error messages and codes
There was a problem hiding this comment.
I see this come up a fair amount. Is this worth adding to the style guide for expectations on error handling?
There was a problem hiding this comment.
yes we should add it there
| let metadata = expected_rack.metadata.unwrap_or_default(); | ||
| let metadata = model::metadata::Metadata::try_from(metadata) | ||
| .map_err(|e| Status::invalid_argument(format!("Invalid metadata: {}", e)))?; | ||
|
|
There was a problem hiding this comment.
we probably should have a metadata.validate() here and for all other handlers. Or ExpectedRack::validate(), which internally verifies metadata
|
|
||
| // todo: now once all are ready, push inventory to rack manager | ||
| // Check if each expected switch has reached SwitchControllerState::Ready. | ||
| if !config.expected_switches.is_empty() { |
There was a problem hiding this comment.
We should move a lot of this into helper functions. Those could e.g. take expected_rack, rack_type_def, and the result of FindPowerShelfOrSwitchByRackId as input.
But this can all be done in a follow-up PR
| rpc DeleteExpectedRack(ExpectedRackRequest) returns (google.protobuf.Empty); | ||
| // Update an expected rack | ||
| rpc UpdateExpectedRack(ExpectedRack) returns (google.protobuf.Empty); | ||
| // Get a specific expected rack |
There was a problem hiding this comment.
In the future we should probably look into whether to align these "GetExpected" APIs with the paginated API (like FindMachineIds(SearchFilter)).
But since all "expected" things are like this at the moment, this implementation is totally fine.
f98d375 to
5cdcf45
Compare
|
|
||
| /// erase deletes all expected racks. | ||
| pub async fn erase(api_client: &ApiClient) -> CarbideCliResult<()> { | ||
| api_client.0.delete_all_expected_racks().await?; |
There was a problem hiding this comment.
I wish we could have a confirmation here before erasing all the data. This is true for other Expected_Objects as well.
| let _: () = sqlx::query_as(query) | ||
| .bind(&rack_type) | ||
| .bind(&metadata.name) | ||
| .bind(&metadata.description) | ||
| .bind(sqlx::types::Json(&metadata.labels)) | ||
| .bind(expected_rack.rack_id) | ||
| .fetch_one(txn) | ||
| .await | ||
| .map_err(|err: sqlx::Error| match err { | ||
| sqlx::Error::RowNotFound => DatabaseError::NotFoundError { | ||
| kind: "expected_rack", | ||
| id: expected_rack.rack_id.to_string(), | ||
| }, | ||
| _ => DatabaseError::query(query, err), | ||
| })?; |
There was a problem hiding this comment.
Nit: you may find this more readable, up to you (tells sqlx to get an optional and does the NotFoundError if it's None, which avoids needing to inspect the sqlx error type):
| let _: () = sqlx::query_as(query) | |
| .bind(&rack_type) | |
| .bind(&metadata.name) | |
| .bind(&metadata.description) | |
| .bind(sqlx::types::Json(&metadata.labels)) | |
| .bind(expected_rack.rack_id) | |
| .fetch_one(txn) | |
| .await | |
| .map_err(|err: sqlx::Error| match err { | |
| sqlx::Error::RowNotFound => DatabaseError::NotFoundError { | |
| kind: "expected_rack", | |
| id: expected_rack.rack_id.to_string(), | |
| }, | |
| _ => DatabaseError::query(query, err), | |
| })?; | |
| sqlx::query_scalar::<_, Option<RackId>>(query) | |
| .bind(&rack_type) | |
| .bind(&metadata.name) | |
| .bind(&metadata.description) | |
| .bind(sqlx::types::Json(&metadata.labels)) | |
| .bind(expected_rack.rack_id) | |
| .fetch_optional(txn) | |
| .await | |
| .map_err(|err| DatabaseError::query(query, err))? | |
| .ok_or_else(|| DatabaseError::NotFoundError { | |
| kind: "expected_rack", | |
| id: expected_rack.rack_id.to_string(), | |
| })?; |
This introduces an `ExpectedRacks` management flow, where an expected rack contains: - The `rack_id` we expect. - The `rack_type`, which is how we know the number of compute trays, NVLink switches, and power shelves. I'm introducing a basic/introductory `RackTypeConfig` here, but it can (and probably will) iterate over time. From there, nodes continue to register for a given `rack_id` as they have been doing, but I moved that logic into `db_rack`, and we now have: - `register_expected_machine` - `register_expected_power_shelf` - `register_expected_switch`. The behavior is the same as it was, but to me it reads a little bit better to see that a node is registering "into" a rack. This then wires up the `Rack` state controller to look at it's `rack_type` to check if: - The number of expected nodes of each type have registered into the rack. - The nodes are linked. This does NOT: - Take into account node/tray position in the rack, which is a TODO. - Introduce any revamping of the rack state controller states, which is also a TODO. However, I did want to start getting this basic plumbing in place for us to iterate against, hopefully in a way that isn't super disruptive and is something we feel comfortable dropping in + continuing to build on. Some choices I *DID* have to make here, though, which we could change now, are: 1. A node will still "initialize" an empty `Rack` if the `rack_id` doesn't exist yet (just like it does today), but with `rack_type: None`. This allow nodes to start populating a `RackConfig` with nodes even if the `ExpectedRack` doesn't exist yet. Once the `ExpectedRack` comes in, we'll find the matching `rack_id`, see that it has `rack_type: None`, and basically adopt it by giving it a `rack_type`. If it already has a `rack_type`, that means it's effectively been adopted and we'll error, and require the user to "update" the rack. 2. For the state controller, if the `rack_type=None` (and we haven't adopted it through an `ExpectedRack` entry yet), it will stay in `Expected`. We don't want it moving ahead if we don't know if the rack is complete. 3. The `RackTypeConfig` gets embedded in the `RackConfig`. This will allow us to both navigate site config hiccups, AND allow callers (e.g. DCIM) to give us full `RackTypeConfigs` without relying on us having a matching config available in the Carbide config. Another thing I was thinking is that we could make it so nodes only register with a rack if the `Rack` entry exists, which would require it to come in via `ExpectedRacks`, so if a node has a `rack_id` associated, it will basically hang out in its own state controller waiting for a `Rack` to show up to adopt + link to. This is in contrast to the current approach, where a node will create a new rack and register itself with it. Tests included! Signed-off-by: Chet Nichols III <chetn@nvidia.com>
5cdcf45 to
384568b
Compare
Description
This introduces an
ExpectedRacksmanagement flow, where an expected rack contains:rack_idwe expect.rack_type, which is how we know the number of compute trays, NVLink switches, and power shelves. This maps to aRackTypeDefinition.Which leads me to: I'm introducing a basic/introductory
RackTypeDefinitionhere, but it can (and probably will) iterate over time. This will exist in the Carbide config to start, but I've done it in a way that makes it so a DCIM can supply us with a full config as well (the fullRackTypeDefinitionis stored alongside theRackin the database).From there, nodes continue to register for a given
rack_idas they have been doing, but I moved that logic intodb_rack, and we now have:register_expected_machineregister_expected_power_shelfregister_expected_switch.The behavior is the same as it was, but to me it reads a little bit better to see that a node is registering "into" a rack.
This then wires up the
Rackstate controller to look at it'sRackTypeDefinitionto check if:This does NOT:
However, I did want to start getting this basic plumbing in place for us to iterate against, hopefully in a way that isn't super disruptive and is something we feel comfortable dropping in + continuing to build on.
Some choices I DID have to make here, though, which we could change now, are:
Rackif therack_iddoesn't exist yet (just like it does today), but withrack_type: None. This allow nodes to start populating aRackConfigwith nodes even if theExpectedRackdoesn't exist yet. Once theExpectedRackcomes in, we'll find the matchingrack_id, see that it hasrack_type: None, and basically adopt it by giving it arack_type. If it already has arack_type, that means it's effectively been adopted and we'll error, and require the user to "update" the rack.rack_type=None(and we haven't adopted it through anExpectedRackentry yet), it will stay inExpected. We don't want it moving ahead if we don't know if the rack is complete.RackTypeDefinitiongets embedded in theRackConfig. This will allow us to both navigate site config hiccups, AND allow callers (e.g. DCIM) to give us fullRackTypeDefinitionswithout relying on us having a matching config available in the Carbide config.Another thing I was thinking is that we could make it so nodes only register with a rack if the
Rackentry exists, which would require it to come in viaExpectedRacks, so if a node ha s arack_idassociated, it will basically hang out in its own state controller waiting for aRackto show up to adopt + link to. This is in contrast to the current approach, where a node will create a new rack and register itself with it.Tests included.
Signed-off-by: Chet Nichols III chetn@nvidia.com
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes