The CheckpointCleanupManager in cmd/compute-domain-kubelet-plugin/cleanup.go has a race condition when reading the checkpoint file. The periodic cleanup routine calls getCheckpoint() without holding the DeviceState lock, while Prepare() and Unprepare() operations hold the lock when accessing the checkpoint.
Race scenario:
- Cleanup goroutine reads checkpoint (no lock held)
- Prepare/Unprepare operation modifies checkpoint (lock held)
- Checkpoint file becomes inconsistent
Impact
- Data corruption: Checkpoint file can become corrupted with partial/inconsistent data
- Lost claims: Prepare/Unprepare operations may fail due to corrupt checkpoint
- Resource leaks: Stale checkpoint entries may never be cleaned up if cleanup crashes mid-read
Root Cause
The checkpoint cleanup mechanism was added in commit f7a3310 (Oct 2025) but didn't follow the locking pattern established by Prepare() and Unprepare(). The DeviceState struct has an embedded sync.Mutex specifically to protect checkpoint access, but cleanup didn't use it.
The CheckpointCleanupManager in
cmd/compute-domain-kubelet-plugin/cleanup.gohas a race condition when reading the checkpoint file. The periodic cleanup routine callsgetCheckpoint()without holding the DeviceState lock, whilePrepare()andUnprepare()operations hold the lock when accessing the checkpoint.Race scenario:
Impact
Root Cause
The checkpoint cleanup mechanism was added in commit f7a3310 (Oct 2025) but didn't follow the locking pattern established by
Prepare()andUnprepare(). TheDeviceStatestruct has an embeddedsync.Mutexspecifically to protect checkpoint access, but cleanup didn't use it.