The DRA driver currently does not distinguish between a plugin pod restart and a node reboot. During PrepareResourceClaims, if a claim entry exists in the checkpoint and is marked as PrepareCompleted, the driver skips device preparation.
This behavior works for plugin restarts, but is incorrect after a node reboot. Device configurations (e.g., MIG setup or VFIO bindings) are not guaranteed to persist across reboots. As a result, the driver may incorrectly assume devices are already prepared and skip necessary setup.
This leads to workload pods failing due to missing or inconsistent device state. The driver should detect node reboots or otherwise validate device state, and re-run device preparation when the underlying device configuration is no longer consistent with the checkpoint.
Root Cause
After a node reboot, device configurations are not guaranteed to persist. This includes cases such as:
- MIG (Multi-Instance GPU) configurations
- Devices bound to VFIO or other drivers
- MPS control daemon is not ready before the workloads could run.
- IMEX daemon is not ready before the workloads could run.
- Timeslicing settings
Despite this, the driver assumes the devices are still in the expected state based on the checkpoint and skips reconfiguration. This results in:
- Devices not being properly prepared
- Workload pods failing due to missing or inconsistent device state
Proposed Solutions
To address this issue, we need a reliable way to detect node reboots or validate device state before skipping preparation. Possible approaches include:
Persist Node Boot ID in Checkpoint
- Store the node’s boot ID (
/proc/sys/kernel/random/boot_id) alongside the checkpointed state.
- On NodePrepare, compare the current boot ID with the stored one. If they differ, treat it as a node reboot and re-run device preparation.
Validate Actual Device State
- Instead of depending on the checkpoint state, verify that the current GPU/device configuration matches the expected prepared state. If there is a mismatch (e.g., missing MIG partitions or incorrect driver bindings), trigger re-configuration.
The DRA driver currently does not distinguish between a plugin pod restart and a node reboot. During
PrepareResourceClaims, if a claim entry exists in the checkpoint and is marked asPrepareCompleted, the driver skips device preparation.This behavior works for plugin restarts, but is incorrect after a node reboot. Device configurations (e.g., MIG setup or VFIO bindings) are not guaranteed to persist across reboots. As a result, the driver may incorrectly assume devices are already prepared and skip necessary setup.
This leads to workload pods failing due to missing or inconsistent device state. The driver should detect node reboots or otherwise validate device state, and re-run device preparation when the underlying device configuration is no longer consistent with the checkpoint.
Root Cause
After a node reboot, device configurations are not guaranteed to persist. This includes cases such as:
Despite this, the driver assumes the devices are still in the expected state based on the checkpoint and skips reconfiguration. This results in:
Proposed Solutions
To address this issue, we need a reliable way to detect node reboots or validate device state before skipping preparation. Possible approaches include:
Persist Node Boot ID in Checkpoint
/proc/sys/kernel/random/boot_id) alongside the checkpointed state.Validate Actual Device State