Skip to content

feat: refine expected racks implementation#529

Open
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:rms_expected_racks
Open

feat: refine expected racks implementation#529
chet wants to merge 1 commit intoNVIDIA:mainfrom
chet:rms_expected_racks

Conversation

@chet
Copy link
Contributor

@chet chet commented Mar 11, 2026

Description

This introduces an ExpectedRacks management flow, where an expected rack contains:

  • The rack_id we expect.
  • The rack_type, which is how we know the number of compute trays, NVLink switches, and power shelves. This maps to a RackTypeDefinition.

Which leads me to: I'm introducing a basic/introductory RackTypeDefinition here, but it can (and probably will) iterate over time. This will exist in the Carbide config to start, but I've done it in a way that makes it so a DCIM can supply us with a full config as well (the full RackTypeDefinition is stored alongside the Rack in the database).

From there, nodes continue to register for a given rack_id as they have been doing, but I moved that logic into db_rack, and we now have:

  • register_expected_machine
  • register_expected_power_shelf
  • register_expected_switch.

The behavior is the same as it was, but to me it reads a little bit better to see that a node is registering "into" a rack.

This then wires up the Rack state controller to look at it's RackTypeDefinition to check if:

  • The number of expected nodes of each type have registered into the rack.
  • The nodes are linked.

This does NOT:

  • Take into account node/tray position in the rack, which is a TODO.
  • Introduce any revamping of the rack state controller states, which is also a TODO.

However, I did want to start getting this basic plumbing in place for us to iterate against, hopefully in a way that isn't super disruptive and is something we feel comfortable dropping in + continuing to build on.

Some choices I DID have to make here, though, which we could change now, are:

  1. A node will still "initialize" an empty Rack if the rack_id doesn't exist yet (just like it does today), but with rack_type: None. This allow nodes to start populating a RackConfig with nodes even if the ExpectedRack doesn't exist yet. Once the ExpectedRack comes in, we'll find the matching rack_id, see that it has rack_type: None, and basically adopt it by giving it a rack_type. If it already has a rack_type, that means it's effectively been adopted and we'll error, and require the user to "update" the rack.
  2. For the state controller, if the rack_type=None (and we haven't adopted it through an ExpectedRack entry yet), it will stay in Expected. We don't want it moving ahead if we don't know if the rack is complete.
  3. The RackTypeDefinition gets embedded in the RackConfig. This will allow us to both navigate site config hiccups, AND allow callers (e.g. DCIM) to give us full RackTypeDefinitions without relying on us having a matching config available in the Carbide config.

Another thing I was thinking is that we could make it so nodes only register with a rack if the Rack entry exists, which would require it to come in via ExpectedRacks, so if a node ha s a rack_id associated, it will basically hang out in its own state controller waiting for a Rack to show up to adopt + link to. This is in contrast to the current approach, where a node will create a new rack and register itself with it.

Tests included.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@chet chet requested a review from a team as a code owner March 11, 2026 21:50
@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-11 21:51:53 UTC | Commit: 8e7ef0a

@github-actions
Copy link

github-actions bot commented Mar 11, 2026

🛡️ Vulnerability Scan

🚨 Found 72 vulnerability(ies)
📊 vs main: 72 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 72
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-16 20:50:28 UTC | Commit: 384568b

Copy link
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super quick pass-through.

The ExpectedRack definition, DB model and associated APIs look fine to me.
RackTypeDefinition is something we could think about further. But we can improve it in follow-up PRs too.


/// Human-readable description of this rack type.
pub description: String,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thing we might need more info than just the count. E.g. also a model type, in case different components require different handling?

Maybe we model it more like machine_capabilities inside the SKU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, love this idea. I'll do this now instead of taking an iterative approach on it. Brb!

/// rack_type_definition is the full rack type definition embedded at
/// the time the expected rack is created or updated. This avoids
/// runtime lookups against the config file and ensures the rack
/// retains its definition even if the config changes later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess its always debatable whether we want consistency (updating the type definition gives you the latest everywhere) or whether we want to keep old values also valid.

I'd probably lean towards consistency (load latest RackTypeDefinition) here - mostly because thats the same that we do for SKUs. Nice thing is that it allows us to add new properties to a rack type definition later.

let expected_rack = request.into_inner();
let rack_id = expected_rack
.rack_id
.ok_or_else(|| Status::invalid_argument("rack_id is required"))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use CarbideError::InvalidArgument/NotFound/etc and convert to tonic::Status for all error path (for consistent error messages and codes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this come up a fair amount. Is this worth adding to the style guide for expectations on error handling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should add it there

let metadata = expected_rack.metadata.unwrap_or_default();
let metadata = model::metadata::Metadata::try_from(metadata)
.map_err(|e| Status::invalid_argument(format!("Invalid metadata: {}", e)))?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably should have a metadata.validate() here and for all other handlers. Or ExpectedRack::validate(), which internally verifies metadata


// todo: now once all are ready, push inventory to rack manager
// Check if each expected switch has reached SwitchControllerState::Ready.
if !config.expected_switches.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move a lot of this into helper functions. Those could e.g. take expected_rack, rack_type_def, and the result of FindPowerShelfOrSwitchByRackId as input.

But this can all be done in a follow-up PR

rpc DeleteExpectedRack(ExpectedRackRequest) returns (google.protobuf.Empty);
// Update an expected rack
rpc UpdateExpectedRack(ExpectedRack) returns (google.protobuf.Empty);
// Get a specific expected rack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we should probably look into whether to align these "GetExpected" APIs with the paginated API (like FindMachineIds(SearchFilter)).

But since all "expected" things are like this at the moment, this implementation is totally fine.

@chet chet force-pushed the rms_expected_racks branch 2 times, most recently from f98d375 to 5cdcf45 Compare March 11, 2026 23:41

/// erase deletes all expected racks.
pub async fn erase(api_client: &ApiClient) -> CarbideCliResult<()> {
api_client.0.delete_all_expected_racks().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could have a confirmation here before erasing all the data. This is true for other Expected_Objects as well.

Comment on lines +118 to +132
let _: () = sqlx::query_as(query)
.bind(&rack_type)
.bind(&metadata.name)
.bind(&metadata.description)
.bind(sqlx::types::Json(&metadata.labels))
.bind(expected_rack.rack_id)
.fetch_one(txn)
.await
.map_err(|err: sqlx::Error| match err {
sqlx::Error::RowNotFound => DatabaseError::NotFoundError {
kind: "expected_rack",
id: expected_rack.rack_id.to_string(),
},
_ => DatabaseError::query(query, err),
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you may find this more readable, up to you (tells sqlx to get an optional and does the NotFoundError if it's None, which avoids needing to inspect the sqlx error type):

Suggested change
let _: () = sqlx::query_as(query)
.bind(&rack_type)
.bind(&metadata.name)
.bind(&metadata.description)
.bind(sqlx::types::Json(&metadata.labels))
.bind(expected_rack.rack_id)
.fetch_one(txn)
.await
.map_err(|err: sqlx::Error| match err {
sqlx::Error::RowNotFound => DatabaseError::NotFoundError {
kind: "expected_rack",
id: expected_rack.rack_id.to_string(),
},
_ => DatabaseError::query(query, err),
})?;
sqlx::query_scalar::<_, Option<RackId>>(query)
.bind(&rack_type)
.bind(&metadata.name)
.bind(&metadata.description)
.bind(sqlx::types::Json(&metadata.labels))
.bind(expected_rack.rack_id)
.fetch_optional(txn)
.await
.map_err(|err| DatabaseError::query(query, err))?
.ok_or_else(|| DatabaseError::NotFoundError {
kind: "expected_rack",
id: expected_rack.rack_id.to_string(),
})?;

This introduces an `ExpectedRacks` management flow, where an expected rack contains:
- The `rack_id` we expect.
- The `rack_type`, which is how we know the number of compute trays, NVLink switches, and power shelves.

I'm introducing a basic/introductory `RackTypeConfig` here, but it can (and probably will) iterate over time.

From there, nodes continue to register for a given `rack_id` as they have been doing, but I moved that logic into `db_rack`, and we now have:
- `register_expected_machine`
- `register_expected_power_shelf`
- `register_expected_switch`.

The behavior is the same as it was, but to me it reads a little bit better to see that a node is registering "into" a rack.

This then wires up the `Rack` state controller to look at it's `rack_type` to check if:
- The number of expected nodes of each type have registered into the rack.
- The nodes are linked.

This does NOT:
- Take into account node/tray position in the rack, which is a TODO.
- Introduce any revamping of the rack state controller states, which is also a TODO.

However, I did want to start getting this basic plumbing in place for us to iterate against, hopefully in a way that isn't super disruptive and is something we feel comfortable dropping in + continuing to build on.

Some choices I *DID* have to make here, though, which we could change now, are:
1. A node will still "initialize" an empty `Rack` if the `rack_id` doesn't exist yet (just like it does today), but with `rack_type: None`. This allow nodes to start populating a `RackConfig` with nodes even if the `ExpectedRack` doesn't exist yet. Once the `ExpectedRack` comes in, we'll find the matching `rack_id`, see that it has `rack_type: None`, and basically adopt it by giving it a `rack_type`. If it already has a `rack_type`, that means it's effectively been adopted and we'll error, and require the user to "update" the rack.
2. For the state controller, if the `rack_type=None` (and we haven't adopted it through an `ExpectedRack` entry yet), it will stay in `Expected`. We don't want it moving ahead if we don't know if the rack is complete.
3. The `RackTypeConfig` gets embedded in the `RackConfig`. This will allow us to both navigate site config hiccups, AND allow callers (e.g. DCIM) to give us full `RackTypeConfigs` without relying on us having a matching config available in the Carbide config.

Another thing I was thinking is that we could make it so nodes only register with a rack if the `Rack` entry exists, which would require it to come in via `ExpectedRacks`, so if a node has a `rack_id` associated, it will basically hang out in its own state controller waiting for a `Rack` to show up to adopt + link to. This is in contrast to the current approach, where a node will create a new rack and register itself with it.

Tests included!

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
@chet chet force-pushed the rms_expected_racks branch from 5cdcf45 to 384568b Compare March 16, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants