feat: Implement Carbide provider for Switch and Powershelf component managers#256
feat: Implement Carbide provider for Switch and Powershelf component managers#256spydaNVIDIA wants to merge 1 commit intomainfrom
Conversation
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-03-13 19:46:02 UTC | Commit: 1e71ba7 |
📝 WalkthroughWalkthroughThis pull request extends the Carbide component management API with new protobuf RPC definitions for power control, inventory, and firmware operations across machines, switches, and power shelves. It implements a new Power Shelf component manager using Carbide, refactors the existing NVL Switch manager to use the new RPCs, and updates the gRPC client layer to support these operations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can use TruffleHog to scan for secrets in your code with verification capabilities.Add a TruffleHog config file (e.g. trufflehog-config.yml, trufflehog.yml) to your project to customize detectors and scanning behavior. The tool runs only when a config file is present. |
🛡️ Vulnerability Scan🚨 Found 64 vulnerability(ies) Severity Breakdown:
🔗 View full details in Security tab 🕐 Last updated: 2026-03-13 19:46:17 UTC | Commit: 1e71ba7 |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (5)
rla/internal/task/componentmanager/powershelf/carbide/carbide_test.go (1)
123-134: Strengthen firmware status validation in this test.
assert.NotNil(statuses)is too weak and can pass on empty output. Assert expected component IDs (and state after setup) to make this test meaningful.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rla/internal/task/componentmanager/powershelf/carbide/carbide_test.go` around lines 123 - 134, TestGetFirmwareUpdateStatus currently only checks statuses is non-nil; strengthen it by asserting the returned slice contains the expected component ID "ps-1" and the expected state set up by the mock. Update the test after calling m.GetFirmwareUpdateStatus to assert len(statuses) > 0 (or exact length if known), find the entry for component ID "ps-1" and assert its ID equals "ps-1" and its State (or Status field name returned by GetFirmwareUpdateStatus) equals the expected value from the mock; reference the test function TestGetFirmwareUpdateStatus, the manager method GetFirmwareUpdateStatus, the constructor New(carbideapi.NewMockClient()), and the Target.ComponentIDs/ComponentTypePowerShelf to locate and implement these assertions.rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go (1)
277-284: Return deterministic status entries for all requested component IDs.This map currently only includes IDs returned by Carbide. Prefill requested IDs with
Unknown(as done inGetPowerStatus) to avoid silent missing entries.♻️ Proposed refactor
- result := make(map[string]operations.FirmwareUpdateStatus, len(resp.GetStatuses())) + result := make(map[string]operations.FirmwareUpdateStatus, len(target.ComponentIDs)) + for _, id := range target.ComponentIDs { + result[id] = operations.FirmwareUpdateStatus{ + ComponentID: id, + State: operations.FirmwareUpdateStateUnknown, + } + } for _, s := range resp.GetStatuses() { compID := s.GetResult().GetComponentId() + if _, ok := result[compID]; !ok || compID == "" { + continue + } result[compID] = operations.FirmwareUpdateStatus{ ComponentID: compID, State: mapFirmwareState(s.GetState()), } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go` around lines 277 - 284, The map currently only contains entries for component IDs returned by Carbide (resp.GetStatuses()), causing missing entries for requested IDs; change the logic that builds result (the block starting with result := make...) to first prefill result for every requested component ID with operations.FirmwareUpdateStatus{ComponentID: <id>, State: mapFirmwareStateUnknown/Unknown} (matching the approach used in GetPowerStatus), then iterate resp.GetStatuses() and overwrite those entries with the real statuses using s.GetResult().GetComponentId() and mapFirmwareState(s.GetState()); ensure you reference and populate the same map (result) and the same struct type operations.FirmwareUpdateStatus so callers always see a deterministic entry for every requested component.rla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.go (1)
122-133: Strengthen firmware status assertions to catch mapping/coverage regressions.
assert.NotNil(statuses)can pass even when the map is empty. Consider asserting expected component IDs (and state after setup) so this test guards actual behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.go` around lines 122 - 133, The test TestGetFirmwareUpdateStatus currently only asserts statuses is not nil which can hide regressions; update it to assert the map contains the expected component ID(s) and the expected state(s) provided by the mock. After calling New(carbideapi.NewMockClient()) and m.GetFirmwareUpdateStatus(...), assert that len(statuses) == len(target.ComponentIDs), that statuses contains the key target.ComponentIDs[0] (e.g., "switch-1"), and that statuses["switch-1"].State (or the appropriate field on the status struct returned by GetFirmwareUpdateStatus) equals the expected value defined by the carbideapi.NewMockClient() setup so the test fails on mapping/coverage regressions.rla/internal/carbideapi/carbideproto/common.proto (1)
106-113: Avoid using_ONas zero enum value; consider_UNSPECIFIEDinstead.Line 107 sets
SYSTEM_POWER_CONTROL_ON = 0. In proto3, unset enum fields default to zero, so an omittedactionfield would unintentionally default to power-on. While current code paths explicitly set the action value, this design creates a safety risk for future refactoring or external API consumers.🐛 Proposed fix
enum SystemPowerControl { - SYSTEM_POWER_CONTROL_ON = 0; - SYSTEM_POWER_CONTROL_GRACEFUL_SHUTDOWN = 1; - SYSTEM_POWER_CONTROL_FORCE_OFF = 2; - SYSTEM_POWER_CONTROL_GRACEFUL_RESTART = 3; - SYSTEM_POWER_CONTROL_FORCE_RESTART = 4; - SYSTEM_POWER_CONTROL_AC_POWERCYCLE = 5; + SYSTEM_POWER_CONTROL_UNSPECIFIED = 0; + SYSTEM_POWER_CONTROL_ON = 1; + SYSTEM_POWER_CONTROL_GRACEFUL_SHUTDOWN = 2; + SYSTEM_POWER_CONTROL_FORCE_OFF = 3; + SYSTEM_POWER_CONTROL_GRACEFUL_RESTART = 4; + SYSTEM_POWER_CONTROL_FORCE_RESTART = 5; + SYSTEM_POWER_CONTROL_AC_POWERCYCLE = 6; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rla/internal/carbideapi/carbideproto/common.proto` around lines 106 - 113, The enum SystemPowerControl currently has SYSTEM_POWER_CONTROL_ON = 0 which makes omitted fields default to “ON”; change the zero value to a clearly unspecified sentinel (e.g., SYSTEM_POWER_CONTROL_UNSPECIFIED = 0), then renumber the other values so SYSTEM_POWER_CONTROL_ON is a non-zero value (and keep the existing symbolic names: SYSTEM_POWER_CONTROL_GRACEFUL_SHUTDOWN, SYSTEM_POWER_CONTROL_FORCE_OFF, etc.); update any call sites or switch/case logic that rely on the numeric ordering of SystemPowerControl to handle the new UNSPECIFIED value explicitly.rla/internal/task/componentmanager/powershelf/carbide/carbide.go (1)
90-92: Apply a consistent nil-client guard across all Carbide-backed methods.Only
InjectExpectationvalidatesm.carbideClient; other methods assume it is set. A shared guard/helper would prevent inconsistent failure modes.Also applies to: 141-142, 171-172, 220-221, 256-257
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go` around lines 90 - 92, Add a single nil-guard helper on the Carbide manager and call it from every method that uses m.carbideClient: implement a method like ensureCarbideClient() error on the receiver that returns fmt.Errorf("carbide client is not configured") when m.carbideClient == nil, replace the inline nil check in InjectExpectation with a call to ensureCarbideClient(), and add that call at the top of every other Carbide-backed method in this file (the methods that currently access m.carbideClient—e.g., the ones around the original checks at the other comment locations) so all paths consistently validate the client before use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rla/internal/carbideapi/carbideproto/forge.proto`:
- Around line 6565-6568: The response message
ListComponentFirmwareVersionsResponse is ambiguous because it only contains a
single ComponentResult while the request can target multiple component IDs;
change the schema to represent per-component results (either replace the
singular ComponentResult field with repeated ComponentResult results or use a
map<string, ComponentResult> results keyed by component_id) and move the
repeated string versions into that per-component result so each component has
its own success/failure and version list (refer to
ListComponentFirmwareVersionsResponse and ComponentResult and versions for where
to apply the change).
In `@rla/internal/carbideapi/grpc.go`:
- Around line 63-66: The carbide-api client currently falls back to plaintext
when certs.ErrNotPresent; change this to fail-closed unless an explicit opt-in
env var is set. In the grpc client initialization (the block handling
certs.ErrNotPresent and dialOpt /
grpc.WithTransportCredentials(insecure.NewCredentials())), replace the
unconditional insecure fallback with a check for an explicit env var (e.g.,
CARBIDE_API_ALLOW_INSECURE); if the var is not set, return/propagate an error
instead of setting insecure credentials and log a clear error; if the var is
set, log an explicit warning and then set
grpc.WithTransportCredentials(insecure.NewCredentials()) as before so behavior
matches PSM/NSM clients. Ensure you reference certs.ErrNotPresent,
grpc.WithTransportCredentials, insecure.NewCredentials(), and the grpc dial path
where dialOpt is set.
In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go`:
- Around line 293-295: The mapping for
pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS incorrectly returns
operations.FirmwareUpdateStateQueued; change it to return the proper in-progress
internal state (e.g., operations.FirmwareUpdateStateInProgress) so
pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS is represented as in-progress rather
than queued in the function that maps firmware states.
- Around line 203-209: The FirmwareControl implementation in Manager currently
returns an error and breaks NVLSwitch firmware workflows; replace that error
with a delegation to the existing Carbide firmware updater by calling the
StartFirmwareUpdate logic from Manager (i.e., have Manager.FirmwareControl
invoke m.StartFirmwareUpdate with the same context/target/info or adapt
parameters to match StartFirmwareUpdate's signature) so NVLSwitch firmware
control requests are handled by the active Carbide flow; alternatively, if you
prefer an API-level fix, remove FirmwareControl from the ComponentManager
interface so callers are forced to use StartFirmwareUpdate, but the immediate
fix is to delegate Manager.FirmwareControl -> m.StartFirmwareUpdate.
In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go`:
- Around line 146-150: The loop over resp.GetResults() only checks statuses and
thus treats missing results as success; instead validate that every requested
target has a corresponding result and that no requested target is missing or
failed. Modify the code around resp.GetResults() to (1) capture the slice of
requested component IDs used for this request (the variable passed into the
Carbide RPC call), (2) build a map from r.GetComponentId() -> result for
resp.GetResults(), (3) verify the map contains an entry for every requested ID
and that each entry has r.GetStatus() ==
pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS (returning
an error if any are missing or non-success), and apply the same change to the
other loop referenced (the block around lines 225-229) to ensure per-target
coverage rather than just non-error statuses.
- Around line 261-268: The current code only populates result from
resp.GetStatuses(), dropping any requested component IDs that Carbide omitted;
pre-seed the result map with an entry for each requested component ID (e.g.
iterate req.GetComponentIds() or the variable holding the requested IDs) setting
operations.FirmwareUpdateStatus{ComponentID: id, State:
operations.FirmwareStateUnknown} (or the equivalent "Unknown" state) and then
loop resp.GetStatuses() as now to overwrite those entries using
s.GetResult().GetComponentId() and mapFirmwareState(s.GetState()) so omitted
components remain in result with Unknown state.
---
Nitpick comments:
In `@rla/internal/carbideapi/carbideproto/common.proto`:
- Around line 106-113: The enum SystemPowerControl currently has
SYSTEM_POWER_CONTROL_ON = 0 which makes omitted fields default to “ON”; change
the zero value to a clearly unspecified sentinel (e.g.,
SYSTEM_POWER_CONTROL_UNSPECIFIED = 0), then renumber the other values so
SYSTEM_POWER_CONTROL_ON is a non-zero value (and keep the existing symbolic
names: SYSTEM_POWER_CONTROL_GRACEFUL_SHUTDOWN, SYSTEM_POWER_CONTROL_FORCE_OFF,
etc.); update any call sites or switch/case logic that rely on the numeric
ordering of SystemPowerControl to handle the new UNSPECIFIED value explicitly.
In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.go`:
- Around line 122-133: The test TestGetFirmwareUpdateStatus currently only
asserts statuses is not nil which can hide regressions; update it to assert the
map contains the expected component ID(s) and the expected state(s) provided by
the mock. After calling New(carbideapi.NewMockClient()) and
m.GetFirmwareUpdateStatus(...), assert that len(statuses) ==
len(target.ComponentIDs), that statuses contains the key target.ComponentIDs[0]
(e.g., "switch-1"), and that statuses["switch-1"].State (or the appropriate
field on the status struct returned by GetFirmwareUpdateStatus) equals the
expected value defined by the carbideapi.NewMockClient() setup so the test fails
on mapping/coverage regressions.
In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go`:
- Around line 277-284: The map currently only contains entries for component IDs
returned by Carbide (resp.GetStatuses()), causing missing entries for requested
IDs; change the logic that builds result (the block starting with result :=
make...) to first prefill result for every requested component ID with
operations.FirmwareUpdateStatus{ComponentID: <id>, State:
mapFirmwareStateUnknown/Unknown} (matching the approach used in GetPowerStatus),
then iterate resp.GetStatuses() and overwrite those entries with the real
statuses using s.GetResult().GetComponentId() and
mapFirmwareState(s.GetState()); ensure you reference and populate the same map
(result) and the same struct type operations.FirmwareUpdateStatus so callers
always see a deterministic entry for every requested component.
In `@rla/internal/task/componentmanager/powershelf/carbide/carbide_test.go`:
- Around line 123-134: TestGetFirmwareUpdateStatus currently only checks
statuses is non-nil; strengthen it by asserting the returned slice contains the
expected component ID "ps-1" and the expected state set up by the mock. Update
the test after calling m.GetFirmwareUpdateStatus to assert len(statuses) > 0 (or
exact length if known), find the entry for component ID "ps-1" and assert its ID
equals "ps-1" and its State (or Status field name returned by
GetFirmwareUpdateStatus) equals the expected value from the mock; reference the
test function TestGetFirmwareUpdateStatus, the manager method
GetFirmwareUpdateStatus, the constructor New(carbideapi.NewMockClient()), and
the Target.ComponentIDs/ComponentTypePowerShelf to locate and implement these
assertions.
In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go`:
- Around line 90-92: Add a single nil-guard helper on the Carbide manager and
call it from every method that uses m.carbideClient: implement a method like
ensureCarbideClient() error on the receiver that returns fmt.Errorf("carbide
client is not configured") when m.carbideClient == nil, replace the inline nil
check in InjectExpectation with a call to ensureCarbideClient(), and add that
call at the top of every other Carbide-backed method in this file (the methods
that currently access m.carbideClient—e.g., the ones around the original checks
at the other comment locations) so all paths consistently validate the client
before use.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 892a309a-1ba6-4128-bdb1-c11f55494c39
⛔ Files ignored due to path filters (3)
rla/internal/carbideapi/gen/common.pb.gois excluded by!**/*.pb.go,!**/gen/**rla/internal/carbideapi/gen/forge.pb.gois excluded by!**/*.pb.go,!**/gen/**rla/internal/carbideapi/gen/forge_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**
📒 Files selected for processing (13)
rla/cmd/serve.gorla/internal/carbideapi/carbideproto/common.protorla/internal/carbideapi/carbideproto/forge.protorla/internal/carbideapi/grpc.gorla/internal/carbideapi/mock.gorla/internal/carbideapi/mod.gorla/internal/carbideapi/model.gorla/internal/task/componentmanager/config.gorla/internal/task/componentmanager/nvlswitch/carbide/carbide.gorla/internal/task/componentmanager/nvlswitch/carbide/carbide_test.gorla/internal/task/componentmanager/powershelf/carbide/carbide.gorla/internal/task/componentmanager/powershelf/carbide/carbide_test.gorla/internal/task/componentmanager/providers/carbide/power.go
| message ListComponentFirmwareVersionsResponse { | ||
| ComponentResult result = 1; | ||
| repeated string versions = 2; | ||
| } |
There was a problem hiding this comment.
ListComponentFirmwareVersionsResponse is ambiguous for multi-target requests.
The request supports multiple component IDs, but the response has only one ComponentResult. This cannot represent per-component success/failure or per-component version sets.
📐 Suggested schema shape
+message ComponentFirmwareVersionsEntry {
+ ComponentResult result = 1;
+ repeated string versions = 2;
+}
+
message ListComponentFirmwareVersionsResponse {
- ComponentResult result = 1;
- repeated string versions = 2;
+ repeated ComponentFirmwareVersionsEntry entries = 1;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rla/internal/carbideapi/carbideproto/forge.proto` around lines 6565 - 6568,
The response message ListComponentFirmwareVersionsResponse is ambiguous because
it only contains a single ComponentResult while the request can target multiple
component IDs; change the schema to represent per-component results (either
replace the singular ComponentResult field with repeated ComponentResult results
or use a map<string, ComponentResult> results keyed by component_id) and move
the repeated string versions into that per-component result so each component
has its own success/failure and version list (refer to
ListComponentFirmwareVersionsResponse and ComponentResult and versions for where
to apply the change).
| if err == certs.ErrNotPresent { | ||
| return nil, errors.New("Certificates not present, unable to authenticate with carbide-api") | ||
| log.Warn().Msg("Certs not present, using plaintext (insecure) connection to carbide-api") | ||
| dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials()) | ||
| } else { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n 'ErrNotPresent|insecure.NewCredentials|CARBIDE_API_ALLOW_INSECURE' rla/internal/carbideapi/grpc.go
rg -n 'ALLOW.*INSECURE|insecure.NewCredentials' --type=goRepository: NVIDIA/bare-metal-manager-rest
Length of output: 1252
🏁 Script executed:
cat -n rla/internal/carbideapi/grpc.go | sed -n '50,75p'Repository: NVIDIA/bare-metal-manager-rest
Length of output: 1064
🏁 Script executed:
rg -n 'NewClient' rla/internal/carbideapi/ --type=go -A 2 -B 2Repository: NVIDIA/bare-metal-manager-rest
Length of output: 1382
🏁 Script executed:
rg -n 'insecure|plaintext|INSECURE' rla/internal/carbideapi/ --type=go -B 2 -A 2Repository: NVIDIA/bare-metal-manager-rest
Length of output: 872
🏁 Script executed:
rg -l 'ErrNotPresent' --type=go | head -5Repository: NVIDIA/bare-metal-manager-rest
Length of output: 240
🏁 Script executed:
rg -n 'ErrNotPresent' --type=go -B 3 -A 3Repository: NVIDIA/bare-metal-manager-rest
Length of output: 7732
🏁 Script executed:
rg -n 'carbide-api|carbide_api' rla/internal/carbideapi/ --type=go -B 2 -A 2 | head -40Repository: NVIDIA/bare-metal-manager-rest
Length of output: 2804
Align carbide-api client with fail-closed security posture used by PSM and NSM clients.
Carbide-api currently falls back to plaintext when certificates are absent (line 65), whereas similar clients (PSM, NSM) fail with an error instead. Since carbide-api handles management operations (power states, firmware updates, network interfaces), plaintext transport weakens security. Require explicit environment variable opt-in before allowing insecure fallback, mirroring how TLS failures are handled in other components.
Suggested change
tlsConfig, _, err := certs.TLSConfig()
if err != nil {
if err == certs.ErrNotPresent {
- log.Warn().Msg("Certs not present, using plaintext (insecure) connection to carbide-api")
- dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials())
+ if os.Getenv("CARBIDE_API_ALLOW_INSECURE") != "true" {
+ return nil, errors.New("TLS certs not present; refusing insecure carbide-api connection")
+ }
+ log.Warn().Msg("CARBIDE_API_ALLOW_INSECURE=true, using plaintext connection to carbide-api")
+ dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials())
} else {
return nil, err
}
} else {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if err == certs.ErrNotPresent { | |
| return nil, errors.New("Certificates not present, unable to authenticate with carbide-api") | |
| log.Warn().Msg("Certs not present, using plaintext (insecure) connection to carbide-api") | |
| dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials()) | |
| } else { | |
| if err == certs.ErrNotPresent { | |
| if os.Getenv("CARBIDE_API_ALLOW_INSECURE") != "true" { | |
| return nil, errors.New("TLS certs not present; refusing insecure carbide-api connection") | |
| } | |
| log.Warn().Msg("CARBIDE_API_ALLOW_INSECURE=true, using plaintext connection to carbide-api") | |
| dialOpt = grpc.WithTransportCredentials(insecure.NewCredentials()) | |
| } else { |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rla/internal/carbideapi/grpc.go` around lines 63 - 66, The carbide-api client
currently falls back to plaintext when certs.ErrNotPresent; change this to
fail-closed unless an explicit opt-in env var is set. In the grpc client
initialization (the block handling certs.ErrNotPresent and dialOpt /
grpc.WithTransportCredentials(insecure.NewCredentials())), replace the
unconditional insecure fallback with a check for an explicit env var (e.g.,
CARBIDE_API_ALLOW_INSECURE); if the var is not set, return/propagate an error
instead of setting insecure credentials and log a clear error; if the var is
set, log an explicit warning and then set
grpc.WithTransportCredentials(insecure.NewCredentials()) as before so behavior
matches PSM/NSM clients. Ensure you reference certs.ErrNotPresent,
grpc.WithTransportCredentials, insecure.NewCredentials(), and the grpc dial path
where dialOpt is set.
| func (m *Manager) FirmwareControl( | ||
| ctx context.Context, | ||
| target common.Target, | ||
| info operations.FirmwareControlTaskInfo, | ||
| ) error { | ||
| // TODO: Implement firmware control | ||
| switch info.Operation { | ||
| case operations.FirmwareOperationUpgrade: | ||
| // Implement firmware upgrade | ||
| return fmt.Errorf("firmware upgrade not yet implemented for NVLSwitch") | ||
| case operations.FirmwareOperationDowngrade: | ||
| // Implement firmware downgrade | ||
| return fmt.Errorf("firmware downgrade not yet implemented for NVLSwitch") | ||
| case operations.FirmwareOperationRollback: | ||
| // Implement firmware rollback | ||
| return fmt.Errorf("firmware rollback not yet implemented for NVLSwitch") | ||
| case operations.FirmwareOperationVersion: | ||
| // Implement firmware version | ||
| return fmt.Errorf("firmware version not yet implemented for NVLSwitch") | ||
| default: | ||
| return fmt.Errorf("unknown firmware operation: %v", info.Operation) | ||
| } | ||
| return fmt.Errorf("FirmwareControl not yet implemented for NVLSwitch via Carbide; use StartFirmwareUpdate") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find firmware control call paths and task routing.
rg -n --type go -C2 '\.FirmwareControl\s*\('
rg -n --type go -C2 'TaskTypeFirmwareControl|StartFirmwareUpdate|GetFirmwareUpdateStatus'Repository: NVIDIA/bare-metal-manager-rest
Length of output: 50386
🏁 Script executed:
# Check context around the FirmwareControl call in manager.go
sed -n '515,535p' rla/internal/task/manager/manager.go
# Check context around the FirmwareControl call in activity.go
sed -n '80,95p' rla/internal/task/executor/temporalworkflow/activity/activity.go
# Check the executor interface/implementation
rg -n --type go 'type.*Executor.*interface|func.*FirmwareControl' rla/internal/task/executor/ -A 3 -B 1 | head -100Repository: NVIDIA/bare-metal-manager-rest
Length of output: 8630
🏁 Script executed:
# Find the validAndGetComponentManager function
rg -n --type go 'func validAndGetComponentManager' -A 20
# Also check what component managers exist and how targets are routed
rg -n --type go 'componentmanager|ComponentManager' rla/internal/task/executor/temporalworkflow/activity/activity.go -B 5 -A 5 | head -80Repository: NVIDIA/bare-metal-manager-rest
Length of output: 4305
🏁 Script executed:
# Find the GetComponentManager function
rg -n --type go 'func GetComponentManager' -A 30
# Also check the target types
rg -n --type go 'type.*TargetType|TargetTypeNVLSwitch' -B 2 -A 2 | head -50Repository: NVIDIA/bare-metal-manager-rest
Length of output: 790
🏁 Script executed:
# Find registry initialization
rg -n --type go 'cmRegistry\s*=' -B 3 -A 3
# Find where NVLSwitch component manager is registered
rg -n --type go -i 'nvlswitch|carbide' rla/internal/task/executor/temporalworkflow/activity/ -B 2 -A 2
# Find the registry definition
rg -n --type go 'type.*registry|ComponentManagerRegistry' -A 15 | head -80Repository: NVIDIA/bare-metal-manager-rest
Length of output: 9125
🏁 Script executed:
# Find where nvlswitchcarbide is defined and check its Register function
fd -t f 'carbide.go|register.go' rla/internal/task/componentmanager/nvlswitch/carbide/ -x cat {}
# Check what the Register function does
rg -n --type go 'func Register' rla/internal/task/componentmanager/ -A 10 | grep -A 10 'nvlswitch'Repository: NVIDIA/bare-metal-manager-rest
Length of output: 14080
🏁 Script executed:
# Find the Registry implementation to understand how it selects managers
rg -n --type go 'type Registry struct' -A 30 rla/internal/task/componentmanager/
# Also check the GetManager method
rg -n --type go 'func.*GetManager' -A 15 rla/internal/task/componentmanager/Repository: NVIDIA/bare-metal-manager-rest
Length of output: 4358
🏁 Script executed:
# Check how the active manager is selected during initialization
rg -n --type go 'SetActiveManager|active\[' -B 3 -A 3 rla/internal/task/componentmanager/componentmanager.go
# Check the Initialize method in the Registry
rg -n --type go 'func.*Initialize' -A 30 rla/internal/task/componentmanager/componentmanager.goRepository: NVIDIA/bare-metal-manager-rest
Length of output: 1533
🏁 Script executed:
# Check the config to see which implementations are selected as default
rg -n --type go 'ComponentManagers' -B 5 -A 5 rla/cmd/serve.go | head -60
# Also check for any configuration files or defaults
fd -t f 'config|Config' rla/cmd/ -x grep -l 'ComponentManagers\|nvlswitch\|carbide' {}Repository: NVIDIA/bare-metal-manager-rest
Length of output: 56
🏁 Script executed:
# Check the serve.go file for where cmConfig is defined and used
sed -n '200,250p' rla/cmd/serve.go
# Look for where ComponentManagers is populated in the config
rg -n --type go 'ComponentManagers\s*:' -B 5 -A 5Repository: NVIDIA/bare-metal-manager-rest
Length of output: 4630
FirmwareControl will break active NVLSwitch firmware workflows.
This method is called by the firmware control activity (rla/internal/task/executor/temporalworkflow/activity/activity.go:90) for any NVLSwitch target. In production, the Carbide implementation is the active manager for NVLSwitch (rla/internal/task/componentmanager/config.go:242), so existing firmware control workflows will now fail with the error returned from this method. Use StartFirmwareUpdate instead, or remove this method from the ComponentManager interface to force callers to route through the correct method.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go` around lines
203 - 209, The FirmwareControl implementation in Manager currently returns an
error and breaks NVLSwitch firmware workflows; replace that error with a
delegation to the existing Carbide firmware updater by calling the
StartFirmwareUpdate logic from Manager (i.e., have Manager.FirmwareControl
invoke m.StartFirmwareUpdate with the same context/target/info or adapt
parameters to match StartFirmwareUpdate's signature) so NVLSwitch firmware
control requests are handled by the active Carbide flow; alternatively, if you
prefer an API-level fix, remove FirmwareControl from the ComponentManager
interface so callers are forced to use StartFirmwareUpdate, but the immediate
fix is to delegate Manager.FirmwareControl -> m.StartFirmwareUpdate.
| case pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS: | ||
| return operations.FirmwareUpdateStateQueued | ||
| case pb.FirmwareUpdateState_FW_STATE_VERIFYING: |
There was a problem hiding this comment.
Map FW_STATE_IN_PROGRESS to an in-progress internal state (not queued).
Line 294 currently downgrades an active update to queued, which misreports progress and can break workflow decisions.
🐛 Proposed fix
case pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS:
- return operations.FirmwareUpdateStateQueued
+ return operations.FirmwareUpdateStateInProgress🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rla/internal/task/componentmanager/nvlswitch/carbide/carbide.go` around lines
293 - 295, The mapping for pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS
incorrectly returns operations.FirmwareUpdateStateQueued; change it to return
the proper in-progress internal state (e.g.,
operations.FirmwareUpdateStateInProgress) so
pb.FirmwareUpdateState_FW_STATE_IN_PROGRESS is represented as in-progress rather
than queued in the function that maps firmware states.
| for _, r := range resp.GetResults() { | ||
| if r.GetStatus() != pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS { | ||
| return fmt.Errorf("power control failed for %s: %s", r.GetComponentId(), r.GetError()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Validate per-target result coverage, not just non-error statuses.
Both loops accept empty/partial result sets as success. If Carbide omits one requested component, the operation is reported successful incorrectly.
✅ Suggested validation helper
+func ensureAllTargetsSucceeded(targetIDs []string, results []*pb.ComponentResult, op string) error {
+ expected := make(map[string]struct{}, len(targetIDs))
+ for _, id := range targetIDs {
+ expected[id] = struct{}{}
+ }
+ seen := make(map[string]struct{}, len(results))
+ for _, r := range results {
+ id := r.GetComponentId()
+ if _, ok := expected[id]; !ok {
+ return fmt.Errorf("%s returned unexpected component result: %s", op, id)
+ }
+ seen[id] = struct{}{}
+ if r.GetStatus() != pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS {
+ return fmt.Errorf("%s failed for %s: %s", op, id, r.GetError())
+ }
+ }
+ for id := range expected {
+ if _, ok := seen[id]; !ok {
+ return fmt.Errorf("%s missing result for %s", op, id)
+ }
+ }
+ return nil
+}
@@
- for _, r := range resp.GetResults() {
- if r.GetStatus() != pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS {
- return fmt.Errorf("power control failed for %s: %s", r.GetComponentId(), r.GetError())
- }
- }
+ if err := ensureAllTargetsSucceeded(target.ComponentIDs, resp.GetResults(), "power control"); err != nil {
+ return err
+ }
@@
- for _, r := range resp.GetResults() {
- if r.GetStatus() != pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS {
- return fmt.Errorf("firmware update failed for %s: %s", r.GetComponentId(), r.GetError())
- }
- }
+ if err := ensureAllTargetsSucceeded(target.ComponentIDs, resp.GetResults(), "firmware update"); err != nil {
+ return err
+ }Also applies to: 225-229
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go` around
lines 146 - 150, The loop over resp.GetResults() only checks statuses and thus
treats missing results as success; instead validate that every requested target
has a corresponding result and that no requested target is missing or failed.
Modify the code around resp.GetResults() to (1) capture the slice of requested
component IDs used for this request (the variable passed into the Carbide RPC
call), (2) build a map from r.GetComponentId() -> result for resp.GetResults(),
(3) verify the map contains an entry for every requested ID and that each entry
has r.GetStatus() ==
pb.ComponentManagerStatusCode_COMPONENT_MANAGER_STATUS_CODE_SUCCESS (returning
an error if any are missing or non-success), and apply the same change to the
other loop referenced (the block around lines 225-229) to ensure per-target
coverage rather than just non-error statuses.
| result := make(map[string]operations.FirmwareUpdateStatus, len(resp.GetStatuses())) | ||
| for _, s := range resp.GetStatuses() { | ||
| compID := s.GetResult().GetComponentId() | ||
| result[compID] = operations.FirmwareUpdateStatus{ | ||
| ComponentID: compID, | ||
| State: mapFirmwareState(s.GetState()), | ||
| } | ||
| } |
There was a problem hiding this comment.
Return status entries for all requested component IDs.
This map is built only from response rows. If Carbide omits a requested component, that component disappears from the result entirely rather than showing Unknown.
🧭 Suggested fix
- result := make(map[string]operations.FirmwareUpdateStatus, len(resp.GetStatuses()))
+ result := make(map[string]operations.FirmwareUpdateStatus, len(target.ComponentIDs))
+ for _, id := range target.ComponentIDs {
+ result[id] = operations.FirmwareUpdateStatus{
+ ComponentID: id,
+ State: operations.FirmwareUpdateStateUnknown,
+ }
+ }
for _, s := range resp.GetStatuses() {
compID := s.GetResult().GetComponentId()
- result[compID] = operations.FirmwareUpdateStatus{
+ if _, ok := result[compID]; !ok {
+ continue // or return an error for unexpected component IDs
+ }
+ result[compID] = operations.FirmwareUpdateStatus{
ComponentID: compID,
State: mapFirmwareState(s.GetState()),
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result := make(map[string]operations.FirmwareUpdateStatus, len(resp.GetStatuses())) | |
| for _, s := range resp.GetStatuses() { | |
| compID := s.GetResult().GetComponentId() | |
| result[compID] = operations.FirmwareUpdateStatus{ | |
| ComponentID: compID, | |
| State: mapFirmwareState(s.GetState()), | |
| } | |
| } | |
| result := make(map[string]operations.FirmwareUpdateStatus, len(target.ComponentIDs)) | |
| for _, id := range target.ComponentIDs { | |
| result[id] = operations.FirmwareUpdateStatus{ | |
| ComponentID: id, | |
| State: operations.FirmwareUpdateStateUnknown, | |
| } | |
| } | |
| for _, s := range resp.GetStatuses() { | |
| compID := s.GetResult().GetComponentId() | |
| if _, ok := result[compID]; !ok { | |
| continue // or return an error for unexpected component IDs | |
| } | |
| result[compID] = operations.FirmwareUpdateStatus{ | |
| ComponentID: compID, | |
| State: mapFirmwareState(s.GetState()), | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@rla/internal/task/componentmanager/powershelf/carbide/carbide.go` around
lines 261 - 268, The current code only populates result from resp.GetStatuses(),
dropping any requested component IDs that Carbide omitted; pre-seed the result
map with an entry for each requested component ID (e.g. iterate
req.GetComponentIds() or the variable holding the requested IDs) setting
operations.FirmwareUpdateStatus{ComponentID: id, State:
operations.FirmwareStateUnknown} (or the equivalent "Unknown" state) and then
loop resp.GetStatuses() as now to overwrite those entries using
s.GetResult().GetComponentId() and mapFirmwareState(s.GetState()) so omitted
components remain in result with Unknown state.
Description
feat: implement carbide provider for switch and powershelf component managers
Type of Change
Services Affected
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes