On a 3-node worker cluster, the LocalVolumeSet localblock created only two PersistentVolumes even though
discovery found an eligible disk on each worker.
What we observed
• LocalVolumeSet status: DiskMaker: 1/3 Unavailable, totalProvisionedDeviceCount: 2
• diskmaker-manager DaemonSet: 2/3 pods Ready
• On one worker (e.g., compute-1), diskmaker-manager stayed 0/2, ContainerCreating, for a long time
• oc describe pod showed a Pulling event for
registry.redhat.io/openshift4/ose-local-storage-diskmaker-rhel9@sha256:…
and the image never transitioned to Pulled / containers never started
• LocalVolumeDiscovery on all three nodes still showed /dev/sdb (or equivalent) as Available — so this was
not “no disk on the node”
Impact
• Only two localblock PVs exist until DiskMaker runs on every node that should contribute disks
• Anything expecting one PV per worker (e.g., ODF / Ceph) can be short on storage
What fixed it (workaround)
-
Force-delete the stuck pod so the DaemonSet recreates it:
oc delete pod -n openshift-local-storage --force --grace-period=0
-
After the new pod was 2/2 Running, LSO logged “found possible matching disk, waiting 1m0s to claim” on that
node; after ~1 minute the third PV appeared.
Suggested follow-up / investigation
• Why CRI-O / kubelet on the affected node got stuck pulling the LSO images (network, registry auth, node disk,
CRI-O bug, etc.)
• Whether timeouts or retries for long pulls need tuning, or if this should be documented as a known recovery
step
Environment (fill in)
vSphere LSO 4.22
See the related vSphere LSO deployment: https://jenkins-csb-odf-qe-ocs4.dno.corp.redhat.com/job/qe-deploy-ocs-cluster/66778/
On a 3-node worker cluster, the LocalVolumeSet
localblockcreated only two PersistentVolumes even thoughdiscovery found an eligible disk on each worker.
What we observed
•
LocalVolumeSetstatus:DiskMaker: 1/3 Unavailable,totalProvisionedDeviceCount: 2•
diskmaker-managerDaemonSet: 2/3 pods Ready• On one worker (e.g., compute-1),
diskmaker-managerstayed0/2,ContainerCreating, for a long time•
oc describe podshowed aPullingevent for•
LocalVolumeDiscoveryon all three nodes still showed/dev/sdb(or equivalent) asAvailable— so this wasnot “no disk on the node”
Impact
• Only two localblock PVs exist until DiskMaker runs on every node that should contribute disks
• Anything expecting one PV per worker (e.g., ODF / Ceph) can be short on storage
What fixed it (workaround)
Force-delete the stuck pod so the DaemonSet recreates it:
oc delete pod -n openshift-local-storage --force --grace-period=0
After the new pod was 2/2 Running, LSO logged “found possible matching disk, waiting 1m0s to claim” on that
node; after ~1 minute the third PV appeared.
Suggested follow-up / investigation
• Why CRI-O / kubelet on the affected node got stuck pulling the LSO images (network, registry auth, node disk,
CRI-O bug, etc.)
• Whether timeouts or retries for long pulls need tuning, or if this should be documented as a known recovery
step
Environment (fill in)
vSphere LSO 4.22
See the related vSphere LSO deployment: https://jenkins-csb-odf-qe-ocs4.dno.corp.redhat.com/job/qe-deploy-ocs-cluster/66778/