feat: refine expected racks implementation by chet · Pull Request #529 · NVIDIA/ncx-infra-controller-core

chet · 2026-03-11T21:50:04Z

Description

This introduces an ExpectedRacks management flow, where an expected rack contains:

The rack_id we expect.
The rack_type, which is how we know the number of compute trays, NVLink switches, and power shelves. This maps to a RackTypeDefinition.

Which leads me to: I'm introducing a basic/introductory RackTypeDefinition here, but it can (and probably will) iterate over time. This will exist in the Carbide config to start, but I've done it in a way that makes it so a DCIM can supply us with a full config as well (the full RackTypeDefinition is stored alongside the Rack in the database).

From there, nodes continue to register for a given rack_id as they have been doing, but I moved that logic into db_rack, and we now have:

register_expected_machine
register_expected_power_shelf
register_expected_switch.

The behavior is the same as it was, but to me it reads a little bit better to see that a node is registering "into" a rack.

This then wires up the Rack state controller to look at it's RackTypeDefinition to check if:

The number of expected nodes of each type have registered into the rack.
The nodes are linked.

This does NOT:

Take into account node/tray position in the rack, which is a TODO.
Introduce any revamping of the rack state controller states, which is also a TODO.

However, I did want to start getting this basic plumbing in place for us to iterate against, hopefully in a way that isn't super disruptive and is something we feel comfortable dropping in + continuing to build on.

Some choices I DID have to make here, though, which we could change now, are:

A node will still "initialize" an empty Rack if the rack_id doesn't exist yet (just like it does today), but with rack_type: None. This allow nodes to start populating a RackConfig with nodes even if the ExpectedRack doesn't exist yet. Once the ExpectedRack comes in, we'll find the matching rack_id, see that it has rack_type: None, and basically adopt it by giving it a rack_type. If it already has a rack_type, that means it's effectively been adopted and we'll error, and require the user to "update" the rack.
For the state controller, if the rack_type=None (and we haven't adopted it through an ExpectedRack entry yet), it will stay in Expected. We don't want it moving ahead if we don't know if the rack is complete.
The RackTypeDefinition gets embedded in the RackConfig. This will allow us to both navigate site config hiccups, AND allow callers (e.g. DCIM) to give us full RackTypeDefinitions without relying on us having a matching config available in the Carbide config.

Another thing I was thinking is that we could make it so nodes only register with a rack if the Rack entry exists, which would require it to come in via ExpectedRacks, so if a node ha s a rack_id associated, it will basically hang out in its own state controller waiting for a Rack to show up to adopt + link to. This is in contrast to the current approach, where a node will create a new rack and register itself with it.

Tests included.

Signed-off-by: Chet Nichols III chetn@nvidia.com

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

github-actions · 2026-03-11T21:51:54Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-03-11 21:51:53 UTC | Commit: 8e7ef0a}

github-actions · 2026-03-11T21:52:27Z

🛡️ Vulnerability Scan

🚨 Found 72 vulnerability(ies)
📊 vs main: 72 (no change)

Severity Breakdown:

🔴 Critical/High: 72
🟡 Medium: 0
🔵 Low/Info: 0

🔗 View full details in Security tab

_{🕐 Last updated: 2026-03-16 20:50:28 UTC | Commit: 384568b}

Matthias247

super quick pass-through.

The ExpectedRack definition, DB model and associated APIs look fine to me.
RackTypeDefinition is something we could think about further. But we can improve it in follow-up PRs too.

Matthias247 · 2026-03-11T22:16:34Z

crates/api-model/src/rack_type.rs

+
+    /// Human-readable description of this rack type.
+    pub description: String,
+


I thing we might need more info than just the count. E.g. also a model type, in case different components require different handling?

Maybe we model it more like machine_capabilities inside the SKU?

Yup, love this idea. I'll do this now instead of taking an iterative approach on it. Brb!

Matthias247 · 2026-03-11T22:20:00Z

crates/api-model/src/rack.rs

+    /// rack_type_definition is the full rack type definition embedded at
+    /// the time the expected rack is created or updated. This avoids
+    /// runtime lookups against the config file and ensures the rack
+    /// retains its definition even if the config changes later.


I guess its always debatable whether we want consistency (updating the type definition gives you the latest everywhere) or whether we want to keep old values also valid.

I'd probably lean towards consistency (load latest RackTypeDefinition) here - mostly because thats the same that we do for SKUs. Nice thing is that it allows us to add new properties to a rack type definition later.

Matthias247 · 2026-03-11T22:22:35Z

crates/api/src/handlers/expected_rack.rs

+    let expected_rack = request.into_inner();
+    let rack_id = expected_rack
+        .rack_id
+        .ok_or_else(|| Status::invalid_argument("rack_id is required"))?;


use CarbideError::InvalidArgument/NotFound/etc and convert to tonic::Status for all error path (for consistent error messages and codes

I see this come up a fair amount. Is this worth adding to the style guide for expectations on error handling?

yes we should add it there

Matthias247 · 2026-03-11T22:23:48Z

crates/api/src/handlers/expected_rack.rs

+    let metadata = expected_rack.metadata.unwrap_or_default();
+    let metadata = model::metadata::Metadata::try_from(metadata)
+        .map_err(|e| Status::invalid_argument(format!("Invalid metadata: {}", e)))?;
+


we probably should have a metadata.validate() here and for all other handlers. Or ExpectedRack::validate(), which internally verifies metadata

Matthias247 · 2026-03-11T22:26:54Z

crates/api/src/state_controller/rack/handler.rs


-                // todo: now once all are ready, push inventory to rack manager
+                // Check if each expected switch has reached SwitchControllerState::Ready.
+                if !config.expected_switches.is_empty() {


We should move a lot of this into helper functions. Those could e.g. take expected_rack, rack_type_def, and the result of FindPowerShelfOrSwitchByRackId as input.

But this can all be done in a follow-up PR

Matthias247 · 2026-03-11T22:28:52Z

crates/rpc/proto/forge.proto

+  rpc DeleteExpectedRack(ExpectedRackRequest) returns (google.protobuf.Empty);
+  // Update an expected rack
+  rpc UpdateExpectedRack(ExpectedRack) returns (google.protobuf.Empty);
+  // Get a specific expected rack


In the future we should probably look into whether to align these "GetExpected" APIs with the paginated API (like FindMachineIds(SearchFilter)).

But since all "expected" things are like this at the moment, this implementation is totally fine.

abvarshney-nv · 2026-03-13T13:42:49Z

crates/admin-cli/src/expected_rack/erase/cmd.rs

+
+/// erase deletes all expected racks.
+pub async fn erase(api_client: &ApiClient) -> CarbideCliResult<()> {
+    api_client.0.delete_all_expected_racks().await?;


I wish we could have a confirmation here before erasing all the data. This is true for other Expected_Objects as well.

kensimon · 2026-03-16T15:22:31Z

crates/api-db/src/expected_rack.rs

+    let _: () = sqlx::query_as(query)
+        .bind(&rack_type)
+        .bind(&metadata.name)
+        .bind(&metadata.description)
+        .bind(sqlx::types::Json(&metadata.labels))
+        .bind(expected_rack.rack_id)
+        .fetch_one(txn)
+        .await
+        .map_err(|err: sqlx::Error| match err {
+            sqlx::Error::RowNotFound => DatabaseError::NotFoundError {
+                kind: "expected_rack",
+                id: expected_rack.rack_id.to_string(),
+            },
+            _ => DatabaseError::query(query, err),
+        })?;


Nit: you may find this more readable, up to you (tells sqlx to get an optional and does the NotFoundError if it's None, which avoids needing to inspect the sqlx error type):

Suggested change

let _: () = sqlx::query_as(query)

.bind(&rack_type)

.bind(&metadata.name)

.bind(&metadata.description)

.bind(sqlx::types::Json(&metadata.labels))

.bind(expected_rack.rack_id)

.fetch_one(txn)

.await

.map_err(|err: sqlx::Error| match err {

sqlx::Error::RowNotFound => DatabaseError::NotFoundError {

kind: "expected_rack",

id: expected_rack.rack_id.to_string(),

},

_ => DatabaseError::query(query, err),

})?;

sqlx::query_scalar::<_, Option<RackId>>(query)

.bind(&rack_type)

.bind(&metadata.name)

.bind(&metadata.description)

.bind(sqlx::types::Json(&metadata.labels))

.bind(expected_rack.rack_id)

.fetch_optional(txn)

.await

.map_err(|err| DatabaseError::query(query, err))?

.ok_or_else(|| DatabaseError::NotFoundError {

kind: "expected_rack",

id: expected_rack.rack_id.to_string(),

})?;

This introduces an `ExpectedRacks` management flow, where an expected rack contains: - The `rack_id` we expect. - The `rack_type`, which is how we know the number of compute trays, NVLink switches, and power shelves. I'm introducing a basic/introductory `RackTypeConfig` here, but it can (and probably will) iterate over time. From there, nodes continue to register for a given `rack_id` as they have been doing, but I moved that logic into `db_rack`, and we now have: - `register_expected_machine` - `register_expected_power_shelf` - `register_expected_switch`. The behavior is the same as it was, but to me it reads a little bit better to see that a node is registering "into" a rack. This then wires up the `Rack` state controller to look at it's `rack_type` to check if: - The number of expected nodes of each type have registered into the rack. - The nodes are linked. This does NOT: - Take into account node/tray position in the rack, which is a TODO. - Introduce any revamping of the rack state controller states, which is also a TODO. However, I did want to start getting this basic plumbing in place for us to iterate against, hopefully in a way that isn't super disruptive and is something we feel comfortable dropping in + continuing to build on. Some choices I *DID* have to make here, though, which we could change now, are: 1. A node will still "initialize" an empty `Rack` if the `rack_id` doesn't exist yet (just like it does today), but with `rack_type: None`. This allow nodes to start populating a `RackConfig` with nodes even if the `ExpectedRack` doesn't exist yet. Once the `ExpectedRack` comes in, we'll find the matching `rack_id`, see that it has `rack_type: None`, and basically adopt it by giving it a `rack_type`. If it already has a `rack_type`, that means it's effectively been adopted and we'll error, and require the user to "update" the rack. 2. For the state controller, if the `rack_type=None` (and we haven't adopted it through an `ExpectedRack` entry yet), it will stay in `Expected`. We don't want it moving ahead if we don't know if the rack is complete. 3. The `RackTypeConfig` gets embedded in the `RackConfig`. This will allow us to both navigate site config hiccups, AND allow callers (e.g. DCIM) to give us full `RackTypeConfigs` without relying on us having a matching config available in the Carbide config. Another thing I was thinking is that we could make it so nodes only register with a rack if the `Rack` entry exists, which would require it to come in via `ExpectedRacks`, so if a node has a `rack_id` associated, it will basically hang out in its own state controller waiting for a `Rack` to show up to adopt + link to. This is in contrast to the current approach, where a node will create a new rack and register itself with it. Tests included! Signed-off-by: Chet Nichols III <chetn@nvidia.com>

chet requested a review from a team as a code owner March 11, 2026 21:50

Matthias247 reviewed Mar 11, 2026

View reviewed changes

chet force-pushed the rms_expected_racks branch 2 times, most recently from f98d375 to 5cdcf45 Compare March 11, 2026 23:41

abvarshney-nv reviewed Mar 13, 2026

View reviewed changes

kensimon approved these changes Mar 16, 2026

View reviewed changes

chet force-pushed the rms_expected_racks branch from 5cdcf45 to 384568b Compare March 16, 2026 20:48


		/// Human-readable description of this rack type.
		pub description: String,

Conversation

chet commented Mar 11, 2026

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

github-actions bot commented Mar 11, 2026

🔐 TruffleHog Secret Scan

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Vulnerability Scan

Uh oh!

Matthias247 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Mar 11, 2026 •

edited

Loading