Skip to content

feat: Implement Carbide provider for Switch and Powershelf component managers#256

Open
spydaNVIDIA wants to merge 1 commit intomainfrom
feat/core-provider
Open

feat: Implement Carbide provider for Switch and Powershelf component managers#256
spydaNVIDIA wants to merge 1 commit intomainfrom
feat/core-provider

Conversation

@spydaNVIDIA
Copy link
Contributor

Description

feat: implement carbide provider for switch and powershelf component managers

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in Github workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@spydaNVIDIA spydaNVIDIA requested a review from kunzhao-nv March 13, 2026 19:45
@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-13 19:46:02 UTC | Commit: 1e71ba7

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

This pull request extends the Carbide component management API with new protobuf RPC definitions for power control, inventory, and firmware operations across machines, switches, and power shelves. It implements a new Power Shelf component manager using Carbide, refactors the existing NVL Switch manager to use the new RPCs, and updates the gRPC client layer to support these operations.

Changes

Cohort / File(s) Summary
Protocol Buffer Definitions
rla/internal/carbideapi/carbideproto/common.proto, rla/internal/carbideapi/carbideproto/forge.proto
Adds SystemPowerControl enum with six control modes and introduces complete component management RPC surface (ComponentPowerControl, GetComponentInventory, UpdateComponentFirmware, GetComponentFirmwareStatus, ListComponentFirmwareVersions) with supporting message types for results, inventory entries, firmware status tracking, and per-component error handling.
gRPC Client Layer
rla/internal/carbideapi/grpc.go, rla/internal/carbideapi/mod.go, rla/internal/carbideapi/mock.go, rla/internal/carbideapi/model.go
Extends carbideapi Client interface with six new methods for component operations and adds AddExpectedPowerShelfRequest model. Implements transport credentials fallback to insecure mode when TLS certs are absent, and provides mock implementations for testing.
Component Manager Registration
rla/cmd/serve.go, rla/internal/task/componentmanager/config.go
Registers powershelfcarbide provider in component manager and updates production config to map PowerShelf components from PSM to Carbide provider.
NVL Switch Carbide Implementation
rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go, rla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.go
Refactors PowerControl and firmware operations to use new ComponentPowerControl, GetComponentInventory, and GetComponentFirmwareStatus RPCs; introduces helper methods for protocol buffer conversions and firmware state mapping; includes unit tests for power control and firmware operations.
Power Shelf Carbide Implementation
rla/internal/task/componentmanager/powershelf/carbide/carbide.go, rla/internal/task/componentmanager/powershelf/carbide/carbide_test.go, rla/internal/task/componentmanager/providers/carbide/power.go
Introduces complete Manager implementation for Power Shelf component type with lifecycle management (InjectExpectation), power control, status queries, and firmware operations; adds utility functions for protocol buffer conversions and power state extraction; includes comprehensive test coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Carbide's reach now extends so wide,
Power shelves join the component tide,
New RPCs make firmware dance,
Control flows through each granted chance,
Management unified, systems allied!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description is directly related to the changeset, identifying it as a feature implementation for carbide providers with appropriate service classification, testing approach, and change type documentation.
Title check ✅ Passed The title accurately describes the main change—implementing a Carbide provider for Switch and Powershelf component managers, which aligns with the substantial protobuf additions, new Manager implementations, and RPC integrations across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/core-provider
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.

Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present.

@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 64 vulnerability(ies)
📊 vs main: 64 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 64
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-13 19:46:17 UTC | Commit: 1e71ba7

@github-actions
Copy link

github-actions bot commented Mar 13, 2026

Test Results

8 134 tests  +10   8 134 ✅ +10   8m 8s ⏱️ +7s
  140 suites + 1       0 💤 ± 0 
   14 files   ± 0       0 ❌ ± 0 

Results for commit 1e71ba7. ± Comparison against base commit 55c3a85.

♻️ This comment has been updated with latest results.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
rla/internal/task/componentmanager/powershelf/carbide/carbide_test.go (1)

123-134: Strengthen firmware status validation in this test.

assert.NotNil(statuses) is too weak and can pass on empty output. Assert expected component IDs (and state after setup) to make this test meaningful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/componentmanager/powershelf/carbide/carbide_test.go` around
lines 123 - 134, TestGetFirmwareUpdateStatus currently only checks statuses is
non-nil; strengthen it by asserting the returned slice contains the expected
component ID "ps-1" and the expected state set up by the mock. Update the test
after calling m.GetFirmwareUpdateStatus to assert len(statuses) > 0 (or exact
length if known), find the entry for component ID "ps-1" and assert its ID
equals "ps-1" and its State (or Status field name returned by
GetFirmwareUpdateStatus) equals the expected value from the mock; reference the
test function TestGetFirmwareUpdateStatus, the manager method
GetFirmwareUpdateStatus, the constructor New(carbideapi.NewMockClient()), and
the Target.ComponentIDs/ComponentTypePowerShelf to locate and implement these
assertions.
rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go (1)

277-284: Return deterministic status entries for all requested component IDs.

This map currently only includes IDs returned by Carbide. Prefill requested IDs with Unknown (as done in GetPowerStatus) to avoid silent missing entries.

♻️ Proposed refactor
-	result := make(map[string]operations.FirmwareUpdateStatus, len(resp.GetStatuses()))
+	result := make(map[string]operations.FirmwareUpdateStatus, len(target.ComponentIDs))
+	for _, id := range target.ComponentIDs {
+		result[id] = operations.FirmwareUpdateStatus{
+			ComponentID: id,
+			State:       operations.FirmwareUpdateStateUnknown,
+		}
+	}
 	for _, s := range resp.GetStatuses() {
 		compID := s.GetResult().GetComponentId()
+		if _, ok := result[compID]; !ok || compID == "" {
+			continue
+		}
 		result[compID] = operations.FirmwareUpdateStatus{
 			ComponentID: compID,
 			State:       mapFirmwareState(s.GetState()),
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go` around lines
277 - 284, The map currently only contains entries for component IDs returned by
Carbide (resp.GetStatuses()), causing missing entries for requested IDs; change
the logic that builds result (the block starting with result := make...) to
first prefill result for every requested component ID with
operations.FirmwareUpdateStatus{ComponentID: <id>, State:
mapFirmwareStateUnknown/Unknown} (matching the approach used in GetPowerStatus),
then iterate resp.GetStatuses() and overwrite those entries with the real
statuses using s.GetResult().GetComponentId() and
mapFirmwareState(s.GetState()); ensure you reference and populate the same map
(result) and the same struct type operations.FirmwareUpdateStatus so callers
always see a deterministic entry for every requested component.
rla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.go (1)

122-133: Strengthen firmware status assertions to catch mapping/coverage regressions.

assert.NotNil(statuses) can pass even when the map is empty. Consider asserting expected component IDs (and state after setup) so this test guards actual behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.go` around
lines 122 - 133, The test TestGetFirmwareUpdateStatus currently only asserts
statuses is not nil which can hide regressions; update it to assert the map
contains the expected component ID(s) and the expected state(s) provided by the
mock. After calling New(carbideapi.NewMockClient()) and
m.GetFirmwareUpdateStatus(...), assert that len(statuses) ==
len(target.ComponentIDs), that statuses contains the key target.ComponentIDs[0]
(e.g., "switch-1"), and that statuses["switch-1"].State (or the appropriate
field on the status struct returned by GetFirmwareUpdateStatus) equals the
expected value defined by the carbideapi.NewMockClient() setup so the test fails
on mapping/coverage regressions.
rla/internal/carbideapi/carbideproto/common.proto (1)

106-113: Avoid using _ON as zero enum value; consider _UNSPECIFIED instead.

Line 107 sets SYSTEM_POWER_CONTROL_ON = 0. In proto3, unset enum fields default to zero, so an omitted action field would unintentionally default to power-on. While current code paths explicitly set the action value, this design creates a safety risk for future refactoring or external API consumers.

🐛 Proposed fix
 enum SystemPowerControl {
-  SYSTEM_POWER_CONTROL_ON = 0;
-  SYSTEM_POWER_CONTROL_GRACEFUL_SHUTDOWN = 1;
-  SYSTEM_POWER_CONTROL_FORCE_OFF = 2;
-  SYSTEM_POWER_CONTROL_GRACEFUL_RESTART = 3;
-  SYSTEM_POWER_CONTROL_FORCE_RESTART = 4;
-  SYSTEM_POWER_CONTROL_AC_POWERCYCLE = 5;
+  SYSTEM_POWER_CONTROL_UNSPECIFIED = 0;
+  SYSTEM_POWER_CONTROL_ON = 1;
+  SYSTEM_POWER_CONTROL_GRACEFUL_SHUTDOWN = 2;
+  SYSTEM_POWER_CONTROL_FORCE_OFF = 3;
+  SYSTEM_POWER_CONTROL_GRACEFUL_RESTART = 4;
+  SYSTEM_POWER_CONTROL_FORCE_RESTART = 5;
+  SYSTEM_POWER_CONTROL_AC_POWERCYCLE = 6;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/carbideapi/carbideproto/common.proto` around lines 106 - 113,
The enum SystemPowerControl currently has SYSTEM_POWER_CONTROL_ON = 0 which
makes omitted fields default to “ON”; change the zero value to a clearly
unspecified sentinel (e.g., SYSTEM_POWER_CONTROL_UNSPECIFIED = 0), then renumber
the other values so SYSTEM_POWER_CONTROL_ON is a non-zero value (and keep the
existing symbolic names: SYSTEM_POWER_CONTROL_GRACEFUL_SHUTDOWN,
SYSTEM_POWER_CONTROL_FORCE_OFF, etc.); update any call sites or switch/case
logic that rely on the numeric ordering of SystemPowerControl to handle the new
UNSPECIFIED value explicitly.
rla/internal/task/componentmanager/powershelf/carbide/carbide.go (1)

90-92: Apply a consistent nil-client guard across all Carbide-backed methods.

Only InjectExpectation validates m.carbideClient; other methods assume it is set. A shared guard/helper would prevent inconsistent failure modes.

Also applies to: 141-142, 171-172, 220-221, 256-257

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go` around
lines 90 - 92, Add a single nil-guard helper on the Carbide manager and call it
from every method that uses m.carbideClient: implement a method like
ensureCarbideClient() error on the receiver that returns fmt.Errorf("carbide
client is not configured") when m.carbideClient == nil, replace the inline nil
check in InjectExpectation with a call to ensureCarbideClient(), and add that
call at the top of every other Carbide-backed method in this file (the methods
that currently access m.carbideClient—e.g., the ones around the original checks
at the other comment locations) so all paths consistently validate the client
before use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rla/internal/carbideapi/carbideproto/forge.proto`:
- Around line 6565-6568: The response message
ListComponentFirmwareVersionsResponse is ambiguous because it only contains a
single ComponentResult while the request can target multiple component IDs;
change the schema to represent per-component results (either replace the
singular ComponentResult field with repeated ComponentResult results or use a
map<string, ComponentResult> results keyed by component_id) and move the
repeated string versions into that per-component result so each component has
its own success/failure and version list (refer to
ListComponentFirmwareVersionsResponse and ComponentResult and versions for where
to apply the change).

In `@rla/internal/carbideapi/grpc.go`:
- Around line 63-66: The carbide-api client currently falls back to plaintext
when certs.ErrNotPresent; change this to fail-closed unless an explicit opt-in
env var is set. In the grpc client initialization (the block handling
certs.ErrNotPresent and dialOpt /
grpc.WithTransportCredentials(insecure.NewCredentials())), replace the
unconditional insecure fallback with a check for an explicit env var (e.g.,
CARBIDE_API_ALLOW_INSECURE); if the var is not set, return/propagate an error
instead of setting insecure credentials and log a clear error; if the var is
set, log an explicit warning and then set
grpc.WithTransportCredentials(insecure.NewCredentials()) as before so behavior
matches PSM/NSM clients. Ensure you reference certs.ErrNotPresent,
grpc.WithTransportCredentials, insecure.NewCredentials(), and the grpc dial path
where dialOpt is set.

In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go`:
- Around line 293-295: The mapping for
pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS incorrectly returns
operations.FirmwareUpdateStateQueued; change it to return the proper in-progress
internal state (e.g., operations.FirmwareUpdateStateInProgress) so
pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS is represented as in-progress rather
than queued in the function that maps firmware states.
- Around line 203-209: The FirmwareControl implementation in Manager currently
returns an error and breaks NVLSwitch firmware workflows; replace that error
with a delegation to the existing Carbide firmware updater by calling the
StartFirmwareUpdate logic from Manager (i.e., have Manager.FirmwareControl
invoke m.StartFirmwareUpdate with the same context/target/info or adapt
parameters to match StartFirmwareUpdate's signature) so NVLSwitch firmware
control requests are handled by the active Carbide flow; alternatively, if you
prefer an API-level fix, remove FirmwareControl from the ComponentManager
interface so callers are forced to use StartFirmwareUpdate, but the immediate
fix is to delegate Manager.FirmwareControl -> m.StartFirmwareUpdate.

In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go`:
- Around line 146-150: The loop over resp.GetResults() only checks statuses and
thus treats missing results as success; instead validate that every requested
target has a corresponding result and that no requested target is missing or
failed. Modify the code around resp.GetResults() to (1) capture the slice of
requested component IDs used for this request (the variable passed into the
Carbide RPC call), (2) build a map from r.GetComponentId() -> result for
resp.GetResults(), (3) verify the map contains an entry for every requested ID
and that each entry has r.GetStatus() ==
pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS (returning
an error if any are missing or non-success), and apply the same change to the
other loop referenced (the block around lines 225-229) to ensure per-target
coverage rather than just non-error statuses.
- Around line 261-268: The current code only populates result from
resp.GetStatuses(), dropping any requested component IDs that Carbide omitted;
pre-seed the result map with an entry for each requested component ID (e.g.
iterate req.GetComponentIds() or the variable holding the requested IDs) setting
operations.FirmwareUpdateStatus{ComponentID: id, State:
operations.FirmwareStateUnknown} (or the equivalent "Unknown" state) and then
loop resp.GetStatuses() as now to overwrite those entries using
s.GetResult().GetComponentId() and mapFirmwareState(s.GetState()) so omitted
components remain in result with Unknown state.

---

Nitpick comments:
In `@rla/internal/carbideapi/carbideproto/common.proto`:
- Around line 106-113: The enum SystemPowerControl currently has
SYSTEM_POWER_CONTROL_ON = 0 which makes omitted fields default to “ON”; change
the zero value to a clearly unspecified sentinel (e.g.,
SYSTEM_POWER_CONTROL_UNSPECIFIED = 0), then renumber the other values so
SYSTEM_POWER_CONTROL_ON is a non-zero value (and keep the existing symbolic
names: SYSTEM_POWER_CONTROL_GRACEFUL_SHUTDOWN, SYSTEM_POWER_CONTROL_FORCE_OFF,
etc.); update any call sites or switch/case logic that rely on the numeric
ordering of SystemPowerControl to handle the new UNSPECIFIED value explicitly.

In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.go`:
- Around line 122-133: The test TestGetFirmwareUpdateStatus currently only
asserts statuses is not nil which can hide regressions; update it to assert the
map contains the expected component ID(s) and the expected state(s) provided by
the mock. After calling New(carbideapi.NewMockClient()) and
m.GetFirmwareUpdateStatus(...), assert that len(statuses) ==
len(target.ComponentIDs), that statuses contains the key target.ComponentIDs[0]
(e.g., "switch-1"), and that statuses["switch-1"].State (or the appropriate
field on the status struct returned by GetFirmwareUpdateStatus) equals the
expected value defined by the carbideapi.NewMockClient() setup so the test fails
on mapping/coverage regressions.

In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go`:
- Around line 277-284: The map currently only contains entries for component IDs
returned by Carbide (resp.GetStatuses()), causing missing entries for requested
IDs; change the logic that builds result (the block starting with result :=
make...) to first prefill result for every requested component ID with
operations.FirmwareUpdateStatus{ComponentID: <id>, State:
mapFirmwareStateUnknown/Unknown} (matching the approach used in GetPowerStatus),
then iterate resp.GetStatuses() and overwrite those entries with the real
statuses using s.GetResult().GetComponentId() and
mapFirmwareState(s.GetState()); ensure you reference and populate the same map
(result) and the same struct type operations.FirmwareUpdateStatus so callers
always see a deterministic entry for every requested component.

In `@rla/internal/task/componentmanager/powershelf/carbide/carbide_test.go`:
- Around line 123-134: TestGetFirmwareUpdateStatus currently only checks
statuses is non-nil; strengthen it by asserting the returned slice contains the
expected component ID "ps-1" and the expected state set up by the mock. Update
the test after calling m.GetFirmwareUpdateStatus to assert len(statuses) > 0 (or
exact length if known), find the entry for component ID "ps-1" and assert its ID
equals "ps-1" and its State (or Status field name returned by
GetFirmwareUpdateStatus) equals the expected value from the mock; reference the
test function TestGetFirmwareUpdateStatus, the manager method
GetFirmwareUpdateStatus, the constructor New(carbideapi.NewMockClient()), and
the Target.ComponentIDs/ComponentTypePowerShelf to locate and implement these
assertions.

In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go`:
- Around line 90-92: Add a single nil-guard helper on the Carbide manager and
call it from every method that uses m.carbideClient: implement a method like
ensureCarbideClient() error on the receiver that returns fmt.Errorf("carbide
client is not configured") when m.carbideClient == nil, replace the inline nil
check in InjectExpectation with a call to ensureCarbideClient(), and add that
call at the top of every other Carbide-backed method in this file (the methods
that currently access m.carbideClient—e.g., the ones around the original checks
at the other comment locations) so all paths consistently validate the client
before use.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 892a309a-1ba6-4128-bdb1-c11f55494c39

📥 Commits

Reviewing files that changed from the base of the PR and between 55c3a85 and 1e71ba7.

⛔ Files ignored due to path filters (3)
  • rla/internal/carbideapi/gen/common.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • rla/internal/carbideapi/gen/forge.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • rla/internal/carbideapi/gen/forge_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
📒 Files selected for processing (13)
  • rla/cmd/serve.go
  • rla/internal/carbideapi/carbideproto/common.proto
  • rla/internal/carbideapi/carbideproto/forge.proto
  • rla/internal/carbideapi/grpc.go
  • rla/internal/carbideapi/mock.go
  • rla/internal/carbideapi/mod.go
  • rla/internal/carbideapi/model.go
  • rla/internal/task/componentmanager/config.go
  • rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go
  • rla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.go
  • rla/internal/task/componentmanager/powershelf/carbide/carbide.go
  • rla/internal/task/componentmanager/powershelf/carbide/carbide_test.go
  • rla/internal/task/componentmanager/providers/carbide/power.go

Comment on lines +6565 to +6568
message ListComponentFirmwareVersionsResponse {
ComponentResult result = 1;
repeated string versions = 2;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

ListComponentFirmwareVersionsResponse is ambiguous for multi-target requests.

The request supports multiple component IDs, but the response has only one ComponentResult. This cannot represent per-component success/failure or per-component version sets.

📐 Suggested schema shape
+message ComponentFirmwareVersionsEntry {
+  ComponentResult result = 1;
+  repeated string versions = 2;
+}
+
 message ListComponentFirmwareVersionsResponse {
-  ComponentResult result = 1;
-  repeated string versions = 2;
+  repeated ComponentFirmwareVersionsEntry entries = 1;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/carbideapi/carbideproto/forge.proto` around lines 6565 - 6568,
The response message ListComponentFirmwareVersionsResponse is ambiguous because
it only contains a single ComponentResult while the request can target multiple
component IDs; change the schema to represent per-component results (either
replace the singular ComponentResult field with repeated ComponentResult results
or use a map<string, ComponentResult> results keyed by component_id) and move
the repeated string versions into that per-component result so each component
has its own success/failure and version list (refer to
ListComponentFirmwareVersionsResponse and ComponentResult and versions for where
to apply the change).

Comment on lines 63 to +66
if err == certs.ErrNotPresent {
return nil, errors.New("Certificates not present, unable to authenticate with carbide-api")
log.Warn().Msg("Certs not present, using plaintext (insecure) connection to carbide-api")
dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials())
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n 'ErrNotPresent|insecure.NewCredentials|CARBIDE_API_ALLOW_INSECURE' rla/internal/carbideapi/grpc.go
rg -n 'ALLOW.*INSECURE|insecure.NewCredentials' --type=go

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 1252


🏁 Script executed:

cat -n rla/internal/carbideapi/grpc.go | sed -n '50,75p'

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 1064


🏁 Script executed:

rg -n 'NewClient' rla/internal/carbideapi/ --type=go -A 2 -B 2

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 1382


🏁 Script executed:

rg -n 'insecure|plaintext|INSECURE' rla/internal/carbideapi/ --type=go -B 2 -A 2

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 872


🏁 Script executed:

rg -l 'ErrNotPresent' --type=go | head -5

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 240


🏁 Script executed:

rg -n 'ErrNotPresent' --type=go -B 3 -A 3

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 7732


🏁 Script executed:

rg -n 'carbide-api|carbide_api' rla/internal/carbideapi/ --type=go -B 2 -A 2 | head -40

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 2804


Align carbide-api client with fail-closed security posture used by PSM and NSM clients.

Carbide-api currently falls back to plaintext when certificates are absent (line 65), whereas similar clients (PSM, NSM) fail with an error instead. Since carbide-api handles management operations (power states, firmware updates, network interfaces), plaintext transport weakens security. Require explicit environment variable opt-in before allowing insecure fallback, mirroring how TLS failures are handled in other components.

Suggested change
  tlsConfig, _, err := certs.TLSConfig()
  if err != nil {
    if err == certs.ErrNotPresent {
-     log.Warn().Msg("Certs not present, using plaintext (insecure) connection to carbide-api")
-     dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials())
+     if os.Getenv("CARBIDE_API_ALLOW_INSECURE") != "true" {
+       return nil, errors.New("TLS certs not present; refusing insecure carbide-api connection")
+     }
+     log.Warn().Msg("CARBIDE_API_ALLOW_INSECURE=true, using plaintext connection to carbide-api")
+     dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials())
    } else {
      return nil, err
    }
  } else {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err == certs.ErrNotPresent {
return nil, errors.New("Certificates not present, unable to authenticate with carbide-api")
log.Warn().Msg("Certs not present, using plaintext (insecure) connection to carbide-api")
dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials())
} else {
if err == certs.ErrNotPresent {
if os.Getenv("CARBIDE_API_ALLOW_INSECURE") != "true" {
return nil, errors.New("TLS certs not present; refusing insecure carbide-api connection")
}
log.Warn().Msg("CARBIDE_API_ALLOW_INSECURE=true, using plaintext connection to carbide-api")
dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials())
} else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/carbideapi/grpc.go` around lines 63 - 66, The carbide-api client
currently falls back to plaintext when certs.ErrNotPresent; change this to
fail-closed unless an explicit opt-in env var is set. In the grpc client
initialization (the block handling certs.ErrNotPresent and dialOpt /
grpc.WithTransportCredentials(insecure.NewCredentials())), replace the
unconditional insecure fallback with a check for an explicit env var (e.g.,
CARBIDE_API_ALLOW_INSECURE); if the var is not set, return/propagate an error
instead of setting insecure credentials and log a clear error; if the var is
set, log an explicit warning and then set
grpc.WithTransportCredentials(insecure.NewCredentials()) as before so behavior
matches PSM/NSM clients. Ensure you reference certs.ErrNotPresent,
grpc.WithTransportCredentials, insecure.NewCredentials(), and the grpc dial path
where dialOpt is set.

Comment on lines 203 to 209
func (m *Manager) FirmwareControl(
ctx context.Context,
target common.Target,
info operations.FirmwareControlTaskInfo,
) error {
// TODO: Implement firmware control
switch info.Operation {
case operations.FirmwareOperationUpgrade:
// Implement firmware upgrade
return fmt.Errorf("firmware upgrade not yet implemented for NVLSwitch")
case operations.FirmwareOperationDowngrade:
// Implement firmware downgrade
return fmt.Errorf("firmware downgrade not yet implemented for NVLSwitch")
case operations.FirmwareOperationRollback:
// Implement firmware rollback
return fmt.Errorf("firmware rollback not yet implemented for NVLSwitch")
case operations.FirmwareOperationVersion:
// Implement firmware version
return fmt.Errorf("firmware version not yet implemented for NVLSwitch")
default:
return fmt.Errorf("unknown firmware operation: %v", info.Operation)
}
return fmt.Errorf("FirmwareControl not yet implemented for NVLSwitch via Carbide; use StartFirmwareUpdate")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find firmware control call paths and task routing.
rg -n --type go -C2 '\.FirmwareControl\s*\('
rg -n --type go -C2 'TaskTypeFirmwareControl|StartFirmwareUpdate|GetFirmwareUpdateStatus'

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 50386


🏁 Script executed:

# Check context around the FirmwareControl call in manager.go
sed -n '515,535p' rla/internal/task/manager/manager.go

# Check context around the FirmwareControl call in activity.go
sed -n '80,95p' rla/internal/task/executor/temporalworkflow/activity/activity.go

# Check the executor interface/implementation
rg -n --type go 'type.*Executor.*interface|func.*FirmwareControl' rla/internal/task/executor/ -A 3 -B 1 | head -100

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 8630


🏁 Script executed:

# Find the validAndGetComponentManager function
rg -n --type go 'func validAndGetComponentManager' -A 20

# Also check what component managers exist and how targets are routed
rg -n --type go 'componentmanager|ComponentManager' rla/internal/task/executor/temporalworkflow/activity/activity.go -B 5 -A 5 | head -80

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 4305


🏁 Script executed:

# Find the GetComponentManager function
rg -n --type go 'func GetComponentManager' -A 30

# Also check the target types
rg -n --type go 'type.*TargetType|TargetTypeNVLSwitch' -B 2 -A 2 | head -50

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 790


🏁 Script executed:

# Find registry initialization
rg -n --type go 'cmRegistry\s*=' -B 3 -A 3

# Find where NVLSwitch component manager is registered
rg -n --type go -i 'nvlswitch|carbide' rla/internal/task/executor/temporalworkflow/activity/ -B 2 -A 2

# Find the registry definition
rg -n --type go 'type.*registry|ComponentManagerRegistry' -A 15 | head -80

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 9125


🏁 Script executed:

# Find where nvlswitchcarbide is defined and check its Register function
fd -t f 'carbide.go|register.go' rla/internal/task/componentmanager/nvlswitch/carbide/ -x cat {}

# Check what the Register function does
rg -n --type go 'func Register' rla/internal/task/componentmanager/ -A 10 | grep -A 10 'nvlswitch'

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 14080


🏁 Script executed:

# Find the Registry implementation to understand how it selects managers
rg -n --type go 'type Registry struct' -A 30 rla/internal/task/componentmanager/

# Also check the GetManager method
rg -n --type go 'func.*GetManager' -A 15 rla/internal/task/componentmanager/

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 4358


🏁 Script executed:

# Check how the active manager is selected during initialization
rg -n --type go 'SetActiveManager|active\[' -B 3 -A 3 rla/internal/task/componentmanager/componentmanager.go

# Check the Initialize method in the Registry
rg -n --type go 'func.*Initialize' -A 30 rla/internal/task/componentmanager/componentmanager.go

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 1533


🏁 Script executed:

# Check the config to see which implementations are selected as default
rg -n --type go 'ComponentManagers' -B 5 -A 5 rla/cmd/serve.go | head -60

# Also check for any configuration files or defaults
fd -t f 'config|Config' rla/cmd/ -x grep -l 'ComponentManagers\|nvlswitch\|carbide' {}

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 56


🏁 Script executed:

# Check the serve.go file for where cmConfig is defined and used
sed -n '200,250p' rla/cmd/serve.go

# Look for where ComponentManagers is populated in the config
rg -n --type go 'ComponentManagers\s*:' -B 5 -A 5

Repository: NVIDIA/bare-metal-manager-rest

Length of output: 4630


FirmwareControl will break active NVLSwitch firmware workflows.

This method is called by the firmware control activity (rla/internal/task/executor/temporalworkflow/activity/activity.go:90) for any NVLSwitch target. In production, the Carbide implementation is the active manager for NVLSwitch (rla/internal/task/componentmanager/config.go:242), so existing firmware control workflows will now fail with the error returned from this method. Use StartFirmwareUpdate instead, or remove this method from the ComponentManager interface to force callers to route through the correct method.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go` around lines
203 - 209, The FirmwareControl implementation in Manager currently returns an
error and breaks NVLSwitch firmware workflows; replace that error with a
delegation to the existing Carbide firmware updater by calling the
StartFirmwareUpdate logic from Manager (i.e., have Manager.FirmwareControl
invoke m.StartFirmwareUpdate with the same context/target/info or adapt
parameters to match StartFirmwareUpdate's signature) so NVLSwitch firmware
control requests are handled by the active Carbide flow; alternatively, if you
prefer an API-level fix, remove FirmwareControl from the ComponentManager
interface so callers are forced to use StartFirmwareUpdate, but the immediate
fix is to delegate Manager.FirmwareControl -> m.StartFirmwareUpdate.

Comment on lines +293 to +295
case pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS:
return operations.FirmwareUpdateStateQueued
case pb.FirmwareUpdateState_FW_STATE_VERIFYING:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Map FW_STATE_IN_PROGRESS to an in-progress internal state (not queued).

Line 294 currently downgrades an active update to queued, which misreports progress and can break workflow decisions.

🐛 Proposed fix
 	case pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS:
-		return operations.FirmwareUpdateStateQueued
+		return operations.FirmwareUpdateStateInProgress
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go` around lines
293 - 295, The mapping for pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS
incorrectly returns operations.FirmwareUpdateStateQueued; change it to return
the proper in-progress internal state (e.g.,
operations.FirmwareUpdateStateInProgress) so
pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS is represented as in-progress rather
than queued in the function that maps firmware states.

Comment on lines +146 to +150
for _, r := range resp.GetResults() {
if r.GetStatus() != pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS {
return fmt.Errorf("power control failed for %s: %s", r.GetComponentId(), r.GetError())
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate per-target result coverage, not just non-error statuses.

Both loops accept empty/partial result sets as success. If Carbide omits one requested component, the operation is reported successful incorrectly.

✅ Suggested validation helper
+func ensureAllTargetsSucceeded(targetIDs []string, results []*pb.ComponentResult, op string) error {
+	expected := make(map[string]struct{}, len(targetIDs))
+	for _, id := range targetIDs {
+		expected[id] = struct{}{}
+	}
+	seen := make(map[string]struct{}, len(results))
+	for _, r := range results {
+		id := r.GetComponentId()
+		if _, ok := expected[id]; !ok {
+			return fmt.Errorf("%s returned unexpected component result: %s", op, id)
+		}
+		seen[id] = struct{}{}
+		if r.GetStatus() != pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS {
+			return fmt.Errorf("%s failed for %s: %s", op, id, r.GetError())
+		}
+	}
+	for id := range expected {
+		if _, ok := seen[id]; !ok {
+			return fmt.Errorf("%s missing result for %s", op, id)
+		}
+	}
+	return nil
+}
@@
-	for _, r := range resp.GetResults() {
-		if r.GetStatus() != pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS {
-			return fmt.Errorf("power control failed for %s: %s", r.GetComponentId(), r.GetError())
-		}
-	}
+	if err := ensureAllTargetsSucceeded(target.ComponentIDs, resp.GetResults(), "power control"); err != nil {
+		return err
+	}
@@
-	for _, r := range resp.GetResults() {
-		if r.GetStatus() != pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS {
-			return fmt.Errorf("firmware update failed for %s: %s", r.GetComponentId(), r.GetError())
-		}
-	}
+	if err := ensureAllTargetsSucceeded(target.ComponentIDs, resp.GetResults(), "firmware update"); err != nil {
+		return err
+	}

Also applies to: 225-229

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go` around
lines 146 - 150, The loop over resp.GetResults() only checks statuses and thus
treats missing results as success; instead validate that every requested target
has a corresponding result and that no requested target is missing or failed.
Modify the code around resp.GetResults() to (1) capture the slice of requested
component IDs used for this request (the variable passed into the Carbide RPC
call), (2) build a map from r.GetComponentId() -> result for resp.GetResults(),
(3) verify the map contains an entry for every requested ID and that each entry
has r.GetStatus() ==
pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS (returning
an error if any are missing or non-success), and apply the same change to the
other loop referenced (the block around lines 225-229) to ensure per-target
coverage rather than just non-error statuses.

Comment on lines +261 to +268
result := make(map[string]operations.FirmwareUpdateStatus, len(resp.GetStatuses()))
for _, s := range resp.GetStatuses() {
compID := s.GetResult().GetComponentId()
result[compID] = operations.FirmwareUpdateStatus{
ComponentID: compID,
State: mapFirmwareState(s.GetState()),
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return status entries for all requested component IDs.

This map is built only from response rows. If Carbide omits a requested component, that component disappears from the result entirely rather than showing Unknown.

🧭 Suggested fix
-	result := make(map[string]operations.FirmwareUpdateStatus, len(resp.GetStatuses()))
+	result := make(map[string]operations.FirmwareUpdateStatus, len(target.ComponentIDs))
+	for _, id := range target.ComponentIDs {
+		result[id] = operations.FirmwareUpdateStatus{
+			ComponentID: id,
+			State:       operations.FirmwareUpdateStateUnknown,
+		}
+	}
 	for _, s := range resp.GetStatuses() {
 		compID := s.GetResult().GetComponentId()
-		result[compID] = operations.FirmwareUpdateStatus{
+		if _, ok := result[compID]; !ok {
+			continue // or return an error for unexpected component IDs
+		}
+		result[compID] = operations.FirmwareUpdateStatus{
 			ComponentID: compID,
 			State:       mapFirmwareState(s.GetState()),
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result := make(map[string]operations.FirmwareUpdateStatus, len(resp.GetStatuses()))
for _, s := range resp.GetStatuses() {
compID := s.GetResult().GetComponentId()
result[compID] = operations.FirmwareUpdateStatus{
ComponentID: compID,
State: mapFirmwareState(s.GetState()),
}
}
result := make(map[string]operations.FirmwareUpdateStatus, len(target.ComponentIDs))
for _, id := range target.ComponentIDs {
result[id] = operations.FirmwareUpdateStatus{
ComponentID: id,
State: operations.FirmwareUpdateStateUnknown,
}
}
for _, s := range resp.GetStatuses() {
compID := s.GetResult().GetComponentId()
if _, ok := result[compID]; !ok {
continue // or return an error for unexpected component IDs
}
result[compID] = operations.FirmwareUpdateStatus{
ComponentID: compID,
State: mapFirmwareState(s.GetState()),
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go` around
lines 261 - 268, The current code only populates result from resp.GetStatuses(),
dropping any requested component IDs that Carbide omitted; pre-seed the result
map with an entry for each requested component ID (e.g. iterate
req.GetComponentIds() or the variable holding the requested IDs) setting
operations.FirmwareUpdateStatus{ComponentID: id, State:
operations.FirmwareStateUnknown} (or the equivalent "Unknown" state) and then
loop resp.GetStatuses() as now to overwrite those entries using
s.GetResult().GetComponentId() and mapFirmwareState(s.GetState()) so omitted
components remain in result with Unknown state.

@thossain-nv thossain-nv changed the title feat: implement carbide provider for switch and powershelf component managers feat: Implement Carbide provider for Switch and Powershelf component managers Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant