Skip to content

Devices are not re-prepared on node reboot causing workloads to run with inconsistent device state #951

@shivamerla

Description

@shivamerla

The DRA driver currently does not distinguish between a plugin pod restart and a node reboot. During PrepareResourceClaims, if a claim entry exists in the checkpoint and is marked as PrepareCompleted, the driver skips device preparation.

This behavior works for plugin restarts, but is incorrect after a node reboot. Device configurations (e.g., MIG setup or VFIO bindings) are not guaranteed to persist across reboots. As a result, the driver may incorrectly assume devices are already prepared and skip necessary setup.

This leads to workload pods failing due to missing or inconsistent device state. The driver should detect node reboots or otherwise validate device state, and re-run device preparation when the underlying device configuration is no longer consistent with the checkpoint.

Root Cause

After a node reboot, device configurations are not guaranteed to persist. This includes cases such as:

  • MIG (Multi-Instance GPU) configurations
  • Devices bound to VFIO or other drivers
  • MPS control daemon is not ready before the workloads could run.
  • IMEX daemon is not ready before the workloads could run.
  • Timeslicing settings

Despite this, the driver assumes the devices are still in the expected state based on the checkpoint and skips reconfiguration. This results in:

  • Devices not being properly prepared
  • Workload pods failing due to missing or inconsistent device state

Proposed Solutions

To address this issue, we need a reliable way to detect node reboots or validate device state before skipping preparation. Possible approaches include:

Persist Node Boot ID in Checkpoint

  • Store the node’s boot ID (/proc/sys/kernel/random/boot_id) alongside the checkpointed state.
  • On NodePrepare, compare the current boot ID with the stored one. If they differ, treat it as a node reboot and re-run device preparation.

Validate Actual Device State

  • Instead of depending on the checkpoint state, verify that the current GPU/device configuration matches the expected prepared state. If there is a mismatch (e.g., missing MIG partitions or incorrect driver bindings), trigger re-configuration.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

Status

Backlog

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions