Devices are not re-prepared on node reboot causing workloads to run with inconsistent device state

The DRA driver currently does not distinguish between a plugin pod restart and a node reboot. During `PrepareResourceClaims`, if a claim entry exists in the checkpoint and is marked as `PrepareCompleted`, the driver skips device preparation.

This behavior works for plugin restarts, but is incorrect after a node reboot. Device configurations (e.g., MIG setup or VFIO bindings) are not guaranteed to persist across reboots. As a result, the driver may incorrectly assume devices are already prepared and skip necessary setup.

This leads to workload pods failing due to missing or inconsistent device state. The driver should detect node reboots or otherwise validate device state, and re-run device preparation when the underlying device configuration is no longer consistent with the checkpoint.

**Root Cause**

After a node reboot, device configurations are not guaranteed to persist. This includes cases such as:

* MIG (Multi-Instance GPU) configurations
* Devices bound to VFIO or other drivers
* MPS control daemon is not ready before the workloads could run.
* IMEX daemon is not ready before the workloads could run.
* Timeslicing settings

Despite this, the driver assumes the devices are still in the expected state based on the checkpoint and skips reconfiguration. This results in:

* Devices not being properly prepared
* Workload pods failing due to missing or inconsistent device state

**Proposed Solutions**

To address this issue, we need a reliable way to detect node reboots or validate device state before skipping preparation. Possible approaches include:

Persist Node Boot ID in Checkpoint
* Store the node’s boot ID (`/proc/sys/kernel/random/boot_id`) alongside the checkpointed state.
* On NodePrepare, compare the current boot ID with the stored one. If they differ, treat it as a node reboot and re-run device preparation.

Validate Actual Device State
* Instead of depending on the checkpoint state, verify that the current GPU/device configuration matches the expected prepared state. If there is a mismatch (e.g., missing MIG partitions or incorrect driver bindings), trigger re-configuration.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Devices are not re-prepared on node reboot causing workloads to run with inconsistent device state #951

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Devices are not re-prepared on node reboot causing workloads to run with inconsistent device state #951

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions