Skip to content

fix(api,state-controller): Add cloud-init and state-machine fix for link-type flip#2511

Merged
ajf merged 1 commit into
NVIDIA:mainfrom
bcavnvidia:cloud-init-vrf
Jun 16, 2026
Merged

fix(api,state-controller): Add cloud-init and state-machine fix for link-type flip#2511
ajf merged 1 commit into
NVIDIA:mainfrom
bcavnvidia:cloud-init-vrf

Conversation

@bcavnvidia

@bcavnvidia bcavnvidia commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Description

A bug was reported after a DPU was replaced that led to finding that a) the DPU was set to IB link-type and b) that flipping the mode triggered the interface to no longer be the boot device and to also be disabled on Dell machines.

This PR attempts to cover the "unexpected link-type" case for both new machines and existing machines getting a DPU replaced.

  • cloud-init handles the link-type flip and reboot. I.e., it now normalizes VPI link type to Ethernet before management-network-dependent setup runs.
  • DPU reprovision now explicitly repairs host boot configuration after DPU provisioning is healthy, using the existing Redfish machine_setup, BIOS job polling, BIOS verification, and DPU-first boot-order flow before final reboot.
    • Added early legacy DPU cloud-init LINK_TYPE_P1/P2 normalization. It logs locally, tolerates unsupported LINK_TYPE hardware, sets Ethernet mode.
    • Added DPU reprovision host boot-repair states
    • After DPU network sync, reprovision now runs host machine_setup, polls BIOS jobs, verifies BIOS setup, then runs DPU-first boot order before the existing BMC/host reboot path.

Also fixed the related multi-DPU reprovision barrier bug so it checks each iterated DPU’s state instead of the current handler DPU repeatedly.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Key Transition Change

  Before:

  WaitingForNetworkConfig
    -> RebootHostBmc
    -> RebootHost
    -> HostInit/Ready

  After:

  WaitingForNetworkConfig
    -> PrepareHostBootRepair
    -> UnlockHostForBootRepair          # only when lockdown is enabled
    -> CheckHostBootConfig
       or CheckHostBootConfigAfterHostReboot
    -> ConfigureHostBoot
    -> WaitingForHostBiosJob            # vendor job path, if needed
    -> PollingHostBiosSetup
    -> SetHostBootOrder
    -> LockHostAfterBootRepair
    -> RebootHostBmc
    -> RebootHost
    -> HostInit/Ready

Key Handler Behavior Diff

Before:

  • DPU reprovision validated DPU network config, cleared reprovision requests, and went straight to BMC/host reboot.
  • Host BIOS/boot-order repair was not part of the DPU reprovision path.
  • Lockdown could remain enabled while trying to inspect or modify host boot config.
  • CheckHostBootConfig freshness logic could reject a previously accepted DPU observation because state transitions advanced the host state timestamp.

After:

  • DPU reprovision keeps requests active until the final RebootHost transition succeeds.
  • Host boot repair runs before final reboot: unlock if needed, check boot config, run machine_setup, set DPU-first boot order, relock.
  • Normal DPU reprovision CheckHostBootConfig trusts the DPU observation already accepted by WaitingForNetworkConfig.
  • Post-unlock host reboot uses CheckHostBootConfigAfterHostReboot, requiring DPU observations after the recorded host reboot request time.
  • Assigned platform config keeps its original fresh-DPU gate behavior.

NVBUG 5966641

NOTE: In the process of this PR, it was noticed that the Redfish boot-order simulator state is global, so multi-host tests can get false positives. #2597 has been opened for this.

@bcavnvidia bcavnvidia requested a review from a team as a code owner June 12, 2026 15:01
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Summary by CodeRabbit

Release Notes

  • New Features

    • Added host boot repair orchestration as part of DPU reprovisioning, including BIOS configuration, boot order setup, and BMC lockdown management.
    • Implemented Ethernet link type normalization for Mellanox network interfaces during provisioning.
    • Enhanced boot configuration validation with DPU freshness checks to ensure safer remediation sequencing.
  • Improvements

    • Strengthened reprovision state machine with deterministic ordering for recovery validation.

Walkthrough

This PR implements a complete host boot repair sub-sequence within the DPU reprovision state machine, introducing nine new ReprovisionState variants, DPU observation-freshness-gated decision logic, vendor-specific BMC lockdown choreography, and centralized test assertion infrastructure. It also adds Mellanox ethernet link-type normalization (ensure_ethernet_link_type) to both PXE provisioning scripts.

Changes

DPU Host Boot Repair State Machine and PXE Ethernet Normalization

Layer / File(s) Summary
ReprovisionState host-boot-repair variants and ordering derives
crates/api-model/src/machine/mod.rs
Adds nine ReprovisionState variants (PrepareHostBootRepair through LockHostAfterBootRepair) with payload-bearing fields. Extends PartialOrd/Ord to BiosConfigInfo, BiosConfigState, SetBootOrderInfo, SetBootOrderState, UnlockHostState, and PowerState. Adds JSON deserialization test coverage for the new variants.
Multi-DPU state transition routing
crates/machine-controller/src/handler/helpers.rs
Routes WaitingForNetworkConfigPrepareHostBootRepair for reprovision-targeted DPUs only and adds SetHostBootOrderLockHostAfterBootRepairRebootHostBmc in next_state_with_all_dpus_updated.
HostBootConfigDecision/DpuFreshness enums and freshness utilities
crates/machine-controller/src/handler.rs
Defines private HostBootConfigDecision and HostBootConfigDpuFreshness enums. Adds is_dpu_observed_since freshness predicate and update_reprovision_targets_to_reprovision_state factored closure.
Host boot repair handler flow and boot-config decisioning
crates/machine-controller/src/handler.rs
Adds full ReprovisionState handler arms for the boot repair sub-sequence including DPU snapshot validation, freshness-gated recovery reboots, lockdown unlock/re-lock choreography, BIOS job orchestration, boot-order skipping, and check_host_boot_config/should_wait_for_dpus_before_host_boot_config helpers. Moves reprovision-request clearing to after terminal host-reboot acceptance in RebootHost.
Unassigned host boot repair failure restart detection
crates/machine-controller/src/handler.rs
Adds is_unassigned_dpu_reprovision_host_boot_failure predicate and updates start_dpu_reprovision to rebuild reprovision state from Ready when restarting from a BiosSetupFailed top-level failure.
Instance host-platform config reuse of boot-config logic
crates/machine-controller/src/handler.rs
Rewrites HostPlatformConfigurationState::CheckHostConfig to delegate to check_host_boot_config, and branches PollingBiosSetup to LockHost or SetBootOrder via should_skip_boot_order_remediation.
Redfish simulator lockdown and boot-order readiness
crates/redfish/src/libredfish/test_support.rs
Adds is_boot_order_setup and default_lockdown fields to RedfishSimState. Adds lockdown_states(), set_is_boot_order_setup(), and set_lockdown() public methods. Makes lockdown_bmc stateful and is_boot_order_setup configurable. Initializes new host lockdown from default_lockdown.
Shared host-boot-repair test assertion helpers
crates/api-core/src/tests/dpu_reprovisioning.rs
Adds EnabledDisabled import, ReprovisionHostBootRepairShape enum, reprovision_host_boot_repair_states() builder, has_dpu_reprovision_state() predicate, and assert_dpu_reprovision_host_boot_repair() central verifier covering state ordering, lockdown toggling, Redfish action sequencing, and reprovision-request lifecycle.
Test case refactoring and new boot-repair coverage
crates/api-core/src/tests/dpu_reprovisioning.rs
Replaces explicit RebootHostBmc/RebootHost loops across firmware-upgrade, no-upgrade, assigned-host, multi-DPU, and restart-failed scenarios with assert_dpu_reprovision_host_boot_repair. Adds new tests for BIOS-not-set machine_setup execution, boot-order remediation skip, and an unassigned-host BiosSetupFailed+ForceRestart restart-failed case.
Mellanox ethernet link-type normalization in PXE scripts
crates/api/files/bf.cfg, pxe/templates/user-data
Adds ensure_ethernet_link_type() to both PXE provisioning scripts. Discovers the MST config device, queries LINK_TYPE_P1/LINK_TYPE_P2 via mlxconfig with non-fatal error handling, normalizes to Ethernet (=2), and triggers a reboot+exit when changes are applied to force re-entry with persisted settings.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary changes: cloud-init and state-machine fixes addressing a link-type flip issue discovered after DPU replacement.
Description check ✅ Passed The description comprehensively relates to the changeset, detailing the bug context, cloud-init link-type normalization, DPU reprovision enhancements, and state transition diagrams.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@bcavnvidia bcavnvidia marked this pull request as draft June 12, 2026 15:01
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
crates/api-model/src/machine/mod.rs (1)

1819-1819: Clarify (or remove) PartialOrd/Ord derives for machine state/config types

serde serialization does not depend on Ord/PartialOrd; it uses declared field/variant structure. In crates/api-model/src/machine/mod.rs, the nearby docs for BiosConfigInfo/BiosConfigState (and the analogous boot-order types) describe job/state tracking, not ordering semantics. Additionally, repo-wide search found no usage of these types as BTreeMap/BTreeSet keys and no direct cmp/partial_cmp/sort_by* calls on them.

If Ord/PartialOrd is meant for a specific purpose (e.g., defining a “progression” ordering for state-machine checks or sorting), document that intent close to the derives. Otherwise, consider removing the PartialOrd/Ord derives from BiosConfigInfo, BiosConfigState, SetBootOrderInfo, SetBootOrderState, UnlockHostState, and PowerState to avoid implying meaningful ordering where none is currently used.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-model/src/machine/mod.rs` at line 1819, The derive list including
PartialOrd and Ord on types like BiosConfigInfo, BiosConfigState,
SetBootOrderInfo, SetBootOrderState, UnlockHostState, and PowerState incorrectly
implies meaningful ordering; either remove PartialOrd and Ord from the derive
attribute near the #[derive(Debug, Clone, Serialize, Deserialize, Eq, PartialEq,
PartialOrd, Ord)] line for those types, or alternatively add a concise doc
comment next to each type explaining the intended “progression” ordering
semantics if those comparisons are intentional; update the derives or add
documentation consistently for each referenced type so the intent is clear.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/api-core/src/tests/dpu_reprovisioning.rs`:
- Around line 115-185: The helper assert_dpu_reprovision_host_boot_repair
currently validates intermediate repair states but doesn't enforce the
post-repair invariant that reprovision_requested is cleared; fix by advancing
the machine one more iteration after the state loop (use
machine.next_iteration_machine(env).await) and assert that the returned
Machine's reprovision_requested.is_none(), ensuring the final host
reboot/cleanup clears reprovision_requested; keep this check in
assert_dpu_reprovision_host_boot_repair alongside existing Redfish action
assertions.

In `@crates/machine-controller/src/handler.rs`:
- Around line 2672-2678: The closure update_all_dpus_to_reprovision_state (and
similar fan-out/validation code at the other spots) currently uses
state.dpu_snapshots.iter().map(|dpu| &dpu.id).collect_vec() which expands to
every DPU; change these call sites to carry and use the original reprovision
target set (the IDs from reprovision_requested) instead of all dpu_snapshots so
only reprovision-targeted DPUs are switched/validated; locate uses of
reprovision_state.next_state_with_all_dpus_updated and replace the all-DPU
iterator with the collected IDs from the reprovision_requested subset (pass the
reprovision_targets vector through the fan-out/validation closures/functions) so
host-boot-repair logic runs only for intended DPUs.

In `@crates/redfish/src/libredfish/test_support.rs`:
- Around line 59-60: is_boot_order_setup is currently stored on RedfishSimState
(global) causing cross-host leakage; change its scope to per-host state (e.g.,
add is_boot_order_setup: Option<bool> inside the host representation used by
RedfishSimState or maintain a HashMap keyed by host id) and update
RedfishSimClient call sites that read or write it (notably
set_boot_order_dpu_first and the readiness checks currently using
is_boot_order_setup) to use the host-scoped field or map lookup. Ensure
RedfishSimState, RedfishSimClient, and any helper methods that reference
is_boot_order_setup (including the logic invoked by set_boot_order_dpu_first and
the readiness/skip-remediation path) are updated to use the host-specific value
so one host's operation cannot mark another host as ready.

---

Nitpick comments:
In `@crates/api-model/src/machine/mod.rs`:
- Line 1819: The derive list including PartialOrd and Ord on types like
BiosConfigInfo, BiosConfigState, SetBootOrderInfo, SetBootOrderState,
UnlockHostState, and PowerState incorrectly implies meaningful ordering; either
remove PartialOrd and Ord from the derive attribute near the #[derive(Debug,
Clone, Serialize, Deserialize, Eq, PartialEq, PartialOrd, Ord)] line for those
types, or alternatively add a concise doc comment next to each type explaining
the intended “progression” ordering semantics if those comparisons are
intentional; update the derives or add documentation consistently for each
referenced type so the intent is clear.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ea446b0-8be2-4c92-86a6-48dfeaf4e69f

📥 Commits

Reviewing files that changed from the base of the PR and between ac32c67 and d10a540.

📒 Files selected for processing (7)
  • crates/api-core/src/tests/dpu_reprovisioning.rs
  • crates/api-model/src/machine/mod.rs
  • crates/api/files/bf.cfg
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/helpers.rs
  • crates/redfish/src/libredfish/test_support.rs
  • pxe/templates/user-data

Comment thread crates/api-core/src/tests/dpu_reprovisioning.rs
Comment thread crates/machine-controller/src/handler.rs Outdated
Comment thread crates/redfish/src/libredfish/test_support.rs
@abvarshney-nv

Copy link
Copy Markdown
Contributor

This can be a problem for DPF now. DPF provisioned DPUs do not read cloud-init so these DPUs will never update the link type.

@bcavnvidia

Copy link
Copy Markdown
Contributor Author

This can be a problem for DPF now. DPF provisioned DPUs do not read cloud-init so these DPUs will never update the link type.

@abvarshney-nv Glad you're looking at this so I didn't have to ping you explicitly 😛

I thought I saw it as part of a DPU flavor.

@abvarshney-nv

Copy link
Copy Markdown
Contributor

This can be a problem for DPF now. DPF provisioned DPUs do not read cloud-init so these DPUs will never update the link type.

@abvarshney-nv Glad you're looking at this so I didn't have to ping you explicitly 😛
lol

I thought I saw it as part of a DPU flavor.

no, we are not using the DPF's feature to change the link as we were relying on site-explorer. Can't we rely again on site-explorer? Let the SE set the link again and after that state machine can continue reprovisioning.

@bcavnvidia

bcavnvidia commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

no, we are not using the DPF's feature to change the link as we were relying on site-explorer. Can't we rely again on site-explorer? Let the SE set the link again and after that state machine can continue reprovisioning.

@abvarshney-nv Are you thinking of DPU/NIC mode (different from link-type)?

EDIT:
He was thinking of DPU/NIC mode, and DPF is handling link-mode.

@bcavnvidia bcavnvidia marked this pull request as ready for review June 15, 2026 18:08

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/api-core/src/tests/dpu_reprovisioning.rs (1)

1584-1589: 💤 Low value

Minor: Timing-sensitive restart detection gate.

The 1ms sleep ensures the restart request timestamp exceeds failed_at. While this is a recognized pattern for timestamp-ordered tests, it introduces slight fragility under heavy system load. The comment at line 1585 appropriately documents the intent.

Consider using a deterministic approach if test flakiness is observed—for example, explicitly setting restart_reprovision_requested_at to a timestamp guaranteed to be after failed_at via a test helper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/api-core/src/tests/dpu_reprovisioning.rs` around lines 1584 - 1589,
The restart detection gate relies on a 1ms sleep to ensure the restart request
timestamp exceeds failed_at, which can be fragile under heavy system load. If
test flakiness is observed in the trigger_dpu_reprovisioning call with
Mode::Restart, consider implementing a deterministic alternative by creating or
using a test helper that explicitly sets restart_reprovision_requested_at to a
timestamp guaranteed to be after failed_at, rather than depending on timing of
the sleep operation to guarantee the ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/api-core/src/tests/dpu_reprovisioning.rs`:
- Around line 1584-1589: The restart detection gate relies on a 1ms sleep to
ensure the restart request timestamp exceeds failed_at, which can be fragile
under heavy system load. If test flakiness is observed in the
trigger_dpu_reprovisioning call with Mode::Restart, consider implementing a
deterministic alternative by creating or using a test helper that explicitly
sets restart_reprovision_requested_at to a timestamp guaranteed to be after
failed_at, rather than depending on timing of the sleep operation to guarantee
the ordering.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 51cc20b0-8a8f-4f7c-a8d5-c51a4dd3fdc7

📥 Commits

Reviewing files that changed from the base of the PR and between d10a540 and b38d80a.

📒 Files selected for processing (6)
  • crates/api-core/src/tests/dpu_reprovisioning.rs
  • crates/api-model/src/machine/mod.rs
  • crates/api/files/bf.cfg
  • crates/machine-controller/src/handler.rs
  • crates/machine-controller/src/handler/helpers.rs
  • crates/redfish/src/libredfish/test_support.rs
💤 Files with no reviewable changes (5)
  • crates/machine-controller/src/handler/helpers.rs
  • crates/api/files/bf.cfg
  • crates/redfish/src/libredfish/test_support.rs
  • crates/api-model/src/machine/mod.rs
  • crates/machine-controller/src/handler.rs

@ajf ajf merged commit e77b8c6 into NVIDIA:main Jun 16, 2026
54 checks passed
@bcavnvidia bcavnvidia deleted the cloud-init-vrf branch June 16, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants