Skip to content

Commit a93c69f

Browse files
authored
Merge branch 'main' into feat/platform-and-nvl-records
2 parents 59273a7 + d287b3a commit a93c69f

183 files changed

Lines changed: 5301 additions & 3054 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ ncx-infra-controller-core/
3232
│ # [workspace] members list in `Cargo.toml` — each
3333
│ # crate's own `Cargo.toml` has a `description` field.
3434
│ # Note: the directory name does NOT always equal the
35-
│ # crate name (e.g. crates/api/ → crate carbide-api).
35+
│ # crate name (e.g. crates/api/ → crate nico-api).
3636
│ # Use `grep '^name =' crates/<dir>/Cargo.toml | head -1`
3737
│ # to get the actual crate name before running
3838
│ # `cargo test -p <name>` or similar.
@@ -42,7 +42,7 @@ ncx-infra-controller-core/
4242
├── helm/ # Helm chart for Kubernetes deployment
4343
├── bluefield/ # BlueField DPU-specific components
4444
├── pxe/ # PXE boot artifact generation
45-
├── lints/ # Custom Clippy lints (carbide-lints crate)
45+
├── lints/ # Custom Clippy lints (nico-lints crate)
4646
├── include/ # Shared Makefile fragments
4747
├── .github/ # GitHub Actions workflows and templates
4848
├── Cargo.toml # Workspace dependency management
@@ -104,7 +104,7 @@ cargo make pre-commit-verify-workspace
104104

105105
# Individual checks:
106106
cargo make clippy # Clippy linter (warnings = errors)
107-
cargo make carbide-lints # Custom carbide lints (requires nightly setup)
107+
cargo make nico-lints # Custom nico lints (requires nightly setup)
108108
cargo make check-format-flow # Check rustfmt formatting
109109
cargo make check-format-nightly # Check import grouping/sorting (requires nightly)
110110
cargo make check-workspace-deps # Validate dependency declarations in Cargo.toml
@@ -117,7 +117,7 @@ cargo make format-nightly # Also sort imports
117117
```
118118

119119
> **Note:** The nightly toolchain is used only for `check-format-nightly` and
120-
> `carbide-lints`. The stable toolchain pinned in `rust-toolchain.toml` is used
120+
> `nico-lints`. The stable toolchain pinned in `rust-toolchain.toml` is used
121121
> for everything else.
122122
123123
### Top-level Makefile (rest-api entrypoint)

Cargo.lock

Lines changed: 1 addition & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ of the bare-metal lifecycle to fast-track building next generation AI Cloud offe
2020
| Layer | What it installs | Helm release |
2121
|-------|-----------------|--------------|
2222
| **Common services** | MetalLB, cert-manager, Vault, external-secrets, PostgreSQL | via `helmfile` in `helm-prereqs/` |
23-
| **Carbide Core** | NVIDIA Infra Controller (this repo's `helm/` chart) | `carbide` in `forge-system` |
24-
| **Carbide REST** | NVIDIA Infra Controller's REST API, Temporal, Keycloak, site-agent | `carbide-rest` + `carbide-rest-site-agent` in `carbide-rest` |
23+
| **NICo Core** | NVIDIA Infra Controller (this repo's `helm/` chart) | `nico` in `nico-system` |
24+
| **NICo REST** | NVIDIA Infra Controller's REST API, Temporal, Keycloak, site-agent | `nico-rest` + `nico-rest-site-agent` in `nico-rest` |
2525

2626
### Prerequisites
2727

@@ -33,20 +33,20 @@ of the bare-metal lifecycle to fast-track building next generation AI Cloud offe
3333

3434
```bash
3535
# 1. Build and push images to your registry
36-
# Carbide Core image: <your-registry>/nvmetal-carbide:<tag> (this repo)
37-
# Carbide REST images: <your-registry>/carbide-rest-api:<tag>, etc. (infra-controller-rest)
36+
# NICo Core image: <your-registry>/nvmetal-nico:<tag> (this repo)
37+
# NICo REST images: <your-registry>/nico-rest-api:<tag>, etc. (infra-controller-rest)
3838

3939
# 2. Set environment variables
4040
export KUBECONFIG=/path/to/kubeconfig
4141
export REGISTRY_PULL_SECRET=<your-registry-pull-secret-or-ngc-api-key>
4242
export NCX_IMAGE_REGISTRY=<your-registry> # e.g. my-registry.example.com/infra-controller
43-
export NCX_CORE_IMAGE_TAG=<carbide-core-tag> # e.g. v2025.12.30
44-
export NCX_REST_IMAGE_TAG=<carbide-rest-tag> # e.g. v1.0.4
43+
export NCX_CORE_IMAGE_TAG=<nico-core-tag> # e.g. v2025.12.30
44+
export NCX_REST_IMAGE_TAG=<nico-rest-tag> # e.g. v1.0.4
4545

4646
# 3. Customize site-specific values
4747
# Edit helm-prereqs/values/ncx-core.yaml:
48-
# carbide-api.hostname — your site's external API hostname
49-
# carbide-api.siteConfig — network pools, VLAN ranges, IB config, MetalLB VIPs
48+
# nico-api.hostname — your site's external API hostname
49+
# nico-api.siteConfig — network pools, VLAN ranges, IB config, MetalLB VIPs
5050
# Edit helm-prereqs/values/metallb-config.yaml:
5151
# IPAddressPool, BGPPeer — your site's VIP ranges and TOR switch config
5252
# Edit helm-prereqs/values.yaml:
@@ -55,7 +55,7 @@ export NCX_REST_IMAGE_TAG=<carbide-rest-tag> # e.g. v1.0.4
5555
# 4. Point NCX_REPO at infra-controller-rest (auto-detected if a sibling directory)
5656
export NCX_REPO=/path/to/infra-controller-rest # optional
5757

58-
# 5. Run setup — installs common services, Carbide Core, and Carbide REST in order
58+
# 5. Run setup — installs common services, NICo Core, and NICo REST in order
5959
cd helm-prereqs
6060
./setup.sh # interactive
6161
./setup.sh -y # non-interactive (CI/CD)

STYLE_GUIDE.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -129,18 +129,18 @@ using interpolation if it makes sense.
129129

130130
### Core API handler Errors
131131

132-
Inside API handlers, the `CarbideError` data type should be used to construct errors. It should then be converted into
133-
`tonic::Status` using `.into()`. All errors being derived from `CarbideError` assures that the errors will look uniform
132+
Inside API handlers, the `NicoError` data type should be used to construct errors. It should then be converted into
133+
`tonic::Status` using `.into()`. All errors being derived from `NicoError` assures that the errors will look uniform
134134
to tenants.
135135

136-
The `CarbideError` variant that is used should be selected based on whether the error gets returned due to the user
136+
The `NicoError` variant that is used should be selected based on whether the error gets returned due to the user
137137
passing invalid arguments or due to the system not being able to handle the request correctly. Error variants that
138138
should be used if the user passing invalid arguments can be `InvalidArgument`, `InvalidConfiguration`, `NotFoundError`
139139
or `ConcurrentModificationError` - these will map to "4xx-like" gRPC error codes. An example of a system-side error
140-
would be `CarbideError::Internal`.
140+
would be `NicoError::Internal`.
141141

142142
```rust
143-
// Avoid — constructing Status directly, bypassing `CarbideError` error mapping
143+
// Avoid — constructing Status directly, bypassing `NicoError` error mapping
144144
pub async fn create_resource(
145145
api: &Api,
146146
request: Request<rpc::Resource>,
@@ -151,15 +151,15 @@ pub async fn create_resource(
151151
.ok_or_else(|| Status::invalid_argument("id is required"))?;
152152
}
153153

154-
// Prefer — uses `CarbideError::InvalidArgument`
154+
// Prefer — uses `NicoError::InvalidArgument`
155155
pub async fn create_resource(
156156
api: &Api,
157157
request: Request<rpc::Resource>,
158158
) -> Result<Response<()>, Status> {
159159
let resource = request.into_inner();
160160
let id = resource
161161
.id
162-
.ok_or(CarbideError::InvalidArgument("id is required".into()))?;
162+
.ok_or(NicoError::InvalidArgument("id is required".into()))?;
163163
}
164164
```
165165

@@ -172,14 +172,14 @@ checks for each meaningful combination of feature flags we support, which scales
172172

173173
Cases where features *are* warranted:
174174

175-
- For shared crates when only a subset of dependents need certain code: For example, the `carbide_uuid` is used by
176-
several dependents, but only the `carbide_api` crate needs the sqlx conversions. We don't want e.g.
177-
`carbide_admin_cli` to take a dependency on `sqlx`, so the sqlx conversions are behind a `sqlx` crate feature. But
175+
- For shared crates when only a subset of dependents need certain code: For example, the `nico_uuid` is used by
176+
several dependents, but only the `nico_api` crate needs the sqlx conversions. We don't want e.g.
177+
`nico_admin_cli` to take a dependency on `sqlx`, so the sqlx conversions are behind a `sqlx` crate feature. But
178178
this is covered by CI tests, since CI builds both the admin-cli and the api crate, both sets of features are
179179
exercised.
180180

181-
- For supporting non-linux builds: The `carbide_api` crate needs to use types from the `tss-esapi` crate to support
182-
validating secure-boot keys, but `tss-esapi` only builds on Linux. To support developers running `carbide_api` on
181+
- For supporting non-linux builds: The `nico_api` crate needs to use types from the `tss-esapi` crate to support
182+
validating secure-boot keys, but `tss-esapi` only builds on Linux. To support developers running `nico_api` on
183183
their Mac for testing, the parts which require `tss-esapi` are carefully carved out into a `linux-build` feature
184184
(which is enabled by default). We do not run CI tests with this feature disabled, so supporting a build without
185185
`linux-build` enabled is best-effort.
@@ -217,10 +217,10 @@ Avoid spawning background tasks without joining them. Any panics that happen in
217217
the rest of the process unless you join them via `JoinHandle::join()` or add them to a `JoinSet` which is later awaited
218218
with `JoinSet::join_all()`.
219219

220-
For carbide-api, we use a single `JoinSet` to spawn all background tasks, and call `join_all()` to block "forever" until
220+
For nico-api, we use a single `JoinSet` to spawn all background tasks, and call `join_all()` to block "forever" until
221221
the process is shut down. This makes it so any panics in the JoinSet will propagate to the main task, and crash the
222222
process (which is what we want.) If you want to spawn background work, prefer accepting a `&mut JoinSet` and spawn your
223-
background task into it. Your task can be constructed it inside `carbide::setup::initialize_and_spawn_controllers`,
223+
background task into it. Your task can be constructed it inside `nico::setup::initialize_and_spawn_controllers`,
224224
which has a JoinSet it can pass to your `start()` function.
225225

226226
Avoid using `oneshot::Sender<()>` as a cancellation signal, and prefer tokio_util's `CancellationToken`, which can
@@ -286,7 +286,7 @@ impl ClientlessBackgroundJob {
286286
```
287287

288288
Avoid mixing the approaches and returning an RAII handle for "client-less" background tasks, if it only exists to stop
289-
the task when dropped. In carbide-api, there are many such client-less background jobs, and storing each of their
289+
the task when dropped. In nico-api, there are many such client-less background jobs, and storing each of their
290290
handles for the correct lifetime is awkward and error-prone. Propagating a single top-level CancellationToken to each of
291291
them is the preferred approach.
292292

bluefield/otel/puntstatsreceiver/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@ The punt stats receiver generates metrics like the following:
1717
### Punt Stats
1818

1919
```
20-
punt_stats_bytes_total{component="punt_stats",dropped="false",host_name="10-217-170-242.local.forge",protocol="dhcp"} 862206
21-
punt_stats_bytes_total{component="punt_stats",dropped="true",host_name="10-217-170-242.local.forge",protocol="dhcp"} 0
22-
punt_stats_packets_total{component="punt_stats",dropped="false",host_name="10-217-170-242.local.forge",protocol="dhcp"} 2686
23-
punt_stats_packets_total{component="punt_stats",dropped="true",host_name="10-217-170-242.local.forge",protocol="dhcp"} 0
20+
punt_stats_bytes_total{component="punt_stats",dropped="false",host_name="10-217-170-242.local.nico",protocol="dhcp"} 862206
21+
punt_stats_bytes_total{component="punt_stats",dropped="true",host_name="10-217-170-242.local.nico",protocol="dhcp"} 0
22+
punt_stats_packets_total{component="punt_stats",dropped="false",host_name="10-217-170-242.local.nico",protocol="dhcp"} 2686
23+
punt_stats_packets_total{component="punt_stats",dropped="true",host_name="10-217-170-242.local.nico",protocol="dhcp"} 0
2424
```

crates/admin-cli/DEVELOPMENT.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Admin CLI Development Guide
22

33
This guide covers how to develop new subcommands, and work on existing
4-
subcommands, in the `admin-cli` crate (for `carbide-admin-cli`).
4+
subcommands, in the `admin-cli` crate (for `nico-admin-cli`).
55

66
## Table of Contents
77

@@ -78,14 +78,14 @@ handler in `cmd.rs`.
7878
pub mod args;
7979
pub mod cmd;
8080

81-
use crate::errors::CarbideCliResult;
81+
use crate::errors::NicoCliResult;
8282
pub use args::Args;
8383

8484
use crate::cfg::run::Run;
8585
use crate::cfg::runtime::RuntimeContext;
8686

8787
impl Run for Args {
88-
async fn run(self, ctx: &mut RuntimeContext) -> CarbideCliResult<()> {
88+
async fn run(self, ctx: &mut RuntimeContext) -> NicoCliResult<()> {
8989
cmd::show(&ctx.api_client, self, ctx.config.format).await
9090
}
9191
}
@@ -114,7 +114,7 @@ Contains the actual command handler. Receives parsed arguments and
114114
only the specific dependencies it needs (not the full RuntimeContext):
115115

116116
```rust
117-
use ::rpc::admin_cli::{CarbideCliResult, OutputFormat};
117+
use ::rpc::admin_cli::{NicoCliResult, OutputFormat};
118118

119119
use super::args::Args;
120120
use crate::rpc::ApiClient;
@@ -123,7 +123,7 @@ pub async fn show(
123123
api_client: &ApiClient,
124124
args: Args,
125125
format: OutputFormat,
126-
) -> CarbideCliResult<()> {
126+
) -> NicoCliResult<()> {
127127
// Implementation here.
128128
Ok(())
129129
}
@@ -146,7 +146,7 @@ pub(crate) trait Dispatch {
146146
fn dispatch(
147147
self,
148148
ctx: RuntimeContext,
149-
) -> impl std::future::Future<Output = CarbideCliResult<()>>;
149+
) -> impl std::future::Future<Output = NicoCliResult<()>>;
150150
}
151151
```
152152

@@ -161,7 +161,7 @@ pub(crate) trait Run {
161161
fn run(
162162
self,
163163
ctx: &mut RuntimeContext,
164-
) -> impl std::future::Future<Output = CarbideCliResult<()>>;
164+
) -> impl std::future::Future<Output = NicoCliResult<()>>;
165165
}
166166
```
167167

@@ -236,7 +236,7 @@ those specific values to `cmd.rs`:
236236

237237
```rust
238238
impl Run for Args {
239-
async fn run(self, ctx: &mut RuntimeContext) -> CarbideCliResult<()> {
239+
async fn run(self, ctx: &mut RuntimeContext) -> NicoCliResult<()> {
240240
// Simple handler -- just needs API client.
241241
cmd::show(&ctx.api_client, self).await
242242

@@ -289,12 +289,12 @@ Create `src/my_command/show/cmd.rs`:
289289
* ..etc etc.
290290
*/
291291

292-
use crate::errors::CarbideCliResult;
292+
use crate::errors::NicoCliResult;
293293

294294
use super::args::Args;
295295
use crate::rpc::ApiClient;
296296

297-
pub async fn show(api_client: &ApiClient, args: Args) -> CarbideCliResult<()> {
297+
pub async fn show(api_client: &ApiClient, args: Args) -> NicoCliResult<()> {
298298
// Your implementation here.
299299
Ok(())
300300
}
@@ -312,14 +312,14 @@ Create `src/my_command/show/mod.rs`:
312312
pub mod args;
313313
pub mod cmd;
314314

315-
use crate::errors::CarbideCliResult;
315+
use crate::errors::NicoCliResult;
316316
pub use args::Args;
317317

318318
use crate::cfg::run::Run;
319319
use crate::cfg::runtime::RuntimeContext;
320320

321321
impl Run for Args {
322-
async fn run(self, ctx: &mut RuntimeContext) -> CarbideCliResult<()> {
322+
async fn run(self, ctx: &mut RuntimeContext) -> NicoCliResult<()> {
323323
cmd::show(&ctx.api_client, self).await
324324
}
325325
}

crates/admin-cli/src/cfg/cli_options.rs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -48,23 +48,23 @@ pub struct CliOptions {
4848
)]
4949
pub cloud_unsafe_op: Option<String>,
5050

51-
#[clap(short, long, env = "CARBIDE_API_URL")]
51+
#[clap(short, long, env = "API_URL", visible_alias = "carbide-url")]
5252
#[clap(
53-
help = "Default to CARBIDE_API_URL environment variable or $HOME/.config/carbide_api_cli.json file or https://carbide-api.forge-system.svc.cluster.local:1079."
53+
help = "Default to API_URL environment variable or $HOME/.config/carbide_api_cli.json file or https://carbide-api.forge-system.svc.cluster.local:1079."
5454
)]
55-
pub carbide_api: Option<String>,
55+
pub api_url: Option<String>,
5656

5757
#[clap(short, long, value_enum, default_value = "ascii-table")]
5858
pub format: OutputFormat,
5959

6060
#[clap(short, long)]
6161
pub output: Option<String>,
6262

63-
#[clap(long, env = "FORGE_ROOT_CA_PATH")]
63+
#[clap(long, env = "ROOT_CA_PATH", visible_alias = "forge-root-ca-path")]
6464
#[clap(
65-
help = "Default to FORGE_ROOT_CA_PATH environment variable or $HOME/.config/carbide_api_cli.json file."
65+
help = "Default to ROOT_CA_PATH environment variable or $HOME/.config/carbide_api_cli.json file."
6666
)]
67-
pub forge_root_ca_path: Option<String>,
67+
pub root_ca_path: Option<String>,
6868

6969
#[clap(long, env = "CLIENT_CERT_PATH")]
7070
#[clap(

crates/admin-cli/src/component_manager/common.rs

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -170,6 +170,45 @@ pub enum DeviceTargetArgs {
170170
Rack(RackTargetArgs),
171171
}
172172

173+
/// Component-power-control target subset: no rack target since
174+
/// `ComponentPowerControlRequest` only supports machines, switches and
175+
/// power shelves.
176+
#[derive(Subcommand, Debug)]
177+
pub enum PowerControlTargetArgs {
178+
#[clap(about = "Target NVLink switches")]
179+
Switch(SwitchTargetArgs),
180+
181+
#[clap(about = "Target power shelves")]
182+
PowerShelf(PowerShelfTargetArgs),
183+
184+
#[clap(about = "Target compute trays")]
185+
ComputeTray(MachineTargetArgs),
186+
}
187+
188+
#[derive(Copy, Clone, Debug, ValueEnum)]
189+
#[clap(rename_all = "kebab_case")]
190+
pub enum PowerActionArg {
191+
On,
192+
GracefulShutdown,
193+
ForceOff,
194+
GracefulRestart,
195+
ForceRestart,
196+
ACPowercycle,
197+
}
198+
199+
impl From<PowerActionArg> for ::rpc::common::SystemPowerControl {
200+
fn from(action: PowerActionArg) -> Self {
201+
match action {
202+
PowerActionArg::On => Self::On,
203+
PowerActionArg::GracefulShutdown => Self::GracefulShutdown,
204+
PowerActionArg::ForceOff => Self::ForceOff,
205+
PowerActionArg::GracefulRestart => Self::GracefulRestart,
206+
PowerActionArg::ForceRestart => Self::ForceRestart,
207+
PowerActionArg::ACPowercycle => Self::AcPowercycle,
208+
}
209+
}
210+
}
211+
173212
pub fn component_result_status_name(status: i32) -> &'static str {
174213
match rpc::forge::ComponentManagerStatusCode::try_from(status) {
175214
Ok(rpc::forge::ComponentManagerStatusCode::Success) => "success",

0 commit comments

Comments
 (0)