fix(api,state-controller): Add cloud-init and state-machine fix for link-type flip#2511
Conversation
Summary by CodeRabbitRelease Notes
WalkthroughThis PR implements a complete host boot repair sub-sequence within the DPU reprovision state machine, introducing nine new ChangesDPU Host Boot Repair State Machine and PXE Ethernet Normalization
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
32edc03 to
d10a540
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
crates/api-model/src/machine/mod.rs (1)
1819-1819: Clarify (or remove)PartialOrd/Ordderives for machine state/config types
serdeserialization does not depend onOrd/PartialOrd; it uses declared field/variant structure. Incrates/api-model/src/machine/mod.rs, the nearby docs forBiosConfigInfo/BiosConfigState(and the analogous boot-order types) describe job/state tracking, not ordering semantics. Additionally, repo-wide search found no usage of these types asBTreeMap/BTreeSetkeys and no directcmp/partial_cmp/sort_by*calls on them.If
Ord/PartialOrdis meant for a specific purpose (e.g., defining a “progression” ordering for state-machine checks or sorting), document that intent close to the derives. Otherwise, consider removing thePartialOrd/Ordderives fromBiosConfigInfo,BiosConfigState,SetBootOrderInfo,SetBootOrderState,UnlockHostState, andPowerStateto avoid implying meaningful ordering where none is currently used.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-model/src/machine/mod.rs` at line 1819, The derive list including PartialOrd and Ord on types like BiosConfigInfo, BiosConfigState, SetBootOrderInfo, SetBootOrderState, UnlockHostState, and PowerState incorrectly implies meaningful ordering; either remove PartialOrd and Ord from the derive attribute near the #[derive(Debug, Clone, Serialize, Deserialize, Eq, PartialEq, PartialOrd, Ord)] line for those types, or alternatively add a concise doc comment next to each type explaining the intended “progression” ordering semantics if those comparisons are intentional; update the derives or add documentation consistently for each referenced type so the intent is clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/api-core/src/tests/dpu_reprovisioning.rs`:
- Around line 115-185: The helper assert_dpu_reprovision_host_boot_repair
currently validates intermediate repair states but doesn't enforce the
post-repair invariant that reprovision_requested is cleared; fix by advancing
the machine one more iteration after the state loop (use
machine.next_iteration_machine(env).await) and assert that the returned
Machine's reprovision_requested.is_none(), ensuring the final host
reboot/cleanup clears reprovision_requested; keep this check in
assert_dpu_reprovision_host_boot_repair alongside existing Redfish action
assertions.
In `@crates/machine-controller/src/handler.rs`:
- Around line 2672-2678: The closure update_all_dpus_to_reprovision_state (and
similar fan-out/validation code at the other spots) currently uses
state.dpu_snapshots.iter().map(|dpu| &dpu.id).collect_vec() which expands to
every DPU; change these call sites to carry and use the original reprovision
target set (the IDs from reprovision_requested) instead of all dpu_snapshots so
only reprovision-targeted DPUs are switched/validated; locate uses of
reprovision_state.next_state_with_all_dpus_updated and replace the all-DPU
iterator with the collected IDs from the reprovision_requested subset (pass the
reprovision_targets vector through the fan-out/validation closures/functions) so
host-boot-repair logic runs only for intended DPUs.
In `@crates/redfish/src/libredfish/test_support.rs`:
- Around line 59-60: is_boot_order_setup is currently stored on RedfishSimState
(global) causing cross-host leakage; change its scope to per-host state (e.g.,
add is_boot_order_setup: Option<bool> inside the host representation used by
RedfishSimState or maintain a HashMap keyed by host id) and update
RedfishSimClient call sites that read or write it (notably
set_boot_order_dpu_first and the readiness checks currently using
is_boot_order_setup) to use the host-scoped field or map lookup. Ensure
RedfishSimState, RedfishSimClient, and any helper methods that reference
is_boot_order_setup (including the logic invoked by set_boot_order_dpu_first and
the readiness/skip-remediation path) are updated to use the host-specific value
so one host's operation cannot mark another host as ready.
---
Nitpick comments:
In `@crates/api-model/src/machine/mod.rs`:
- Line 1819: The derive list including PartialOrd and Ord on types like
BiosConfigInfo, BiosConfigState, SetBootOrderInfo, SetBootOrderState,
UnlockHostState, and PowerState incorrectly implies meaningful ordering; either
remove PartialOrd and Ord from the derive attribute near the #[derive(Debug,
Clone, Serialize, Deserialize, Eq, PartialEq, PartialOrd, Ord)] line for those
types, or alternatively add a concise doc comment next to each type explaining
the intended “progression” ordering semantics if those comparisons are
intentional; update the derives or add documentation consistently for each
referenced type so the intent is clear.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5ea446b0-8be2-4c92-86a6-48dfeaf4e69f
📒 Files selected for processing (7)
crates/api-core/src/tests/dpu_reprovisioning.rscrates/api-model/src/machine/mod.rscrates/api/files/bf.cfgcrates/machine-controller/src/handler.rscrates/machine-controller/src/handler/helpers.rscrates/redfish/src/libredfish/test_support.rspxe/templates/user-data
d10a540 to
fdf4517
Compare
|
This can be a problem for DPF now. DPF provisioned DPUs do not read cloud-init so these DPUs will never update the link type. |
@abvarshney-nv Glad you're looking at this so I didn't have to ping you explicitly 😛 I thought I saw it as part of a DPU flavor. |
fdf4517 to
e9c68ec
Compare
no, we are not using the DPF's feature to change the link as we were relying on site-explorer. Can't we rely again on site-explorer? Let the SE set the link again and after that state machine can continue reprovisioning. |
@abvarshney-nv Are you thinking of DPU/NIC mode (different from link-type)? EDIT: |
78d4dbf to
ebdbb08
Compare
…B-to-ETH link-type flip
ebdbb08 to
b38d80a
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/api-core/src/tests/dpu_reprovisioning.rs (1)
1584-1589: 💤 Low valueMinor: Timing-sensitive restart detection gate.
The 1ms sleep ensures the restart request timestamp exceeds
failed_at. While this is a recognized pattern for timestamp-ordered tests, it introduces slight fragility under heavy system load. The comment at line 1585 appropriately documents the intent.Consider using a deterministic approach if test flakiness is observed—for example, explicitly setting
restart_reprovision_requested_atto a timestamp guaranteed to be afterfailed_atvia a test helper.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/api-core/src/tests/dpu_reprovisioning.rs` around lines 1584 - 1589, The restart detection gate relies on a 1ms sleep to ensure the restart request timestamp exceeds failed_at, which can be fragile under heavy system load. If test flakiness is observed in the trigger_dpu_reprovisioning call with Mode::Restart, consider implementing a deterministic alternative by creating or using a test helper that explicitly sets restart_reprovision_requested_at to a timestamp guaranteed to be after failed_at, rather than depending on timing of the sleep operation to guarantee the ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/api-core/src/tests/dpu_reprovisioning.rs`:
- Around line 1584-1589: The restart detection gate relies on a 1ms sleep to
ensure the restart request timestamp exceeds failed_at, which can be fragile
under heavy system load. If test flakiness is observed in the
trigger_dpu_reprovisioning call with Mode::Restart, consider implementing a
deterministic alternative by creating or using a test helper that explicitly
sets restart_reprovision_requested_at to a timestamp guaranteed to be after
failed_at, rather than depending on timing of the sleep operation to guarantee
the ordering.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 51cc20b0-8a8f-4f7c-a8d5-c51a4dd3fdc7
📒 Files selected for processing (6)
crates/api-core/src/tests/dpu_reprovisioning.rscrates/api-model/src/machine/mod.rscrates/api/files/bf.cfgcrates/machine-controller/src/handler.rscrates/machine-controller/src/handler/helpers.rscrates/redfish/src/libredfish/test_support.rs
💤 Files with no reviewable changes (5)
- crates/machine-controller/src/handler/helpers.rs
- crates/api/files/bf.cfg
- crates/redfish/src/libredfish/test_support.rs
- crates/api-model/src/machine/mod.rs
- crates/machine-controller/src/handler.rs
Description
A bug was reported after a DPU was replaced that led to finding that a) the DPU was set to IB link-type and b) that flipping the mode triggered the interface to no longer be the boot device and to also be disabled on Dell machines.
This PR attempts to cover the "unexpected link-type" case for both new machines and existing machines getting a DPU replaced.
Also fixed the related multi-DPU reprovision barrier bug so it checks each iterated DPU’s state instead of the current handler DPU repeatedly.
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes
Key Transition Change
Key Handler Behavior Diff
Before:
After:
NVBUG 5966641
NOTE: In the process of this PR, it was noticed that the Redfish boot-order simulator state is global, so multi-host tests can get false positives. #2597 has been opened for this.