Skip to content

Checkpoint race condition in CD kubelet-plugin cleanup can cause data corruption #839

@klueska

Description

@klueska

The CheckpointCleanupManager in cmd/compute-domain-kubelet-plugin/cleanup.go has a race condition when reading the checkpoint file. The periodic cleanup routine calls getCheckpoint() without holding the DeviceState lock, while Prepare() and Unprepare() operations hold the lock when accessing the checkpoint.

Race scenario:

  1. Cleanup goroutine reads checkpoint (no lock held)
  2. Prepare/Unprepare operation modifies checkpoint (lock held)
  3. Checkpoint file becomes inconsistent

Impact

  • Data corruption: Checkpoint file can become corrupted with partial/inconsistent data
  • Lost claims: Prepare/Unprepare operations may fail due to corrupt checkpoint
  • Resource leaks: Stale checkpoint entries may never be cleaned up if cleanup crashes mid-read

Root Cause
The checkpoint cleanup mechanism was added in commit f7a3310 (Oct 2025) but didn't follow the locking pattern established by Prepare() and Unprepare(). The DeviceState struct has an embedded sync.Mutex specifically to protect checkpoint access, but cleanup didn't use it.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions