-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Environment
- seaweedfs-csi-driver version: v1.3.0 (latest)
- seaweedfs version: chrislusf/seaweedfs:3.71_full
Problem
When the FUSE mount process dies (e.g., weed mount is killed), the CSI driver's internal state (ns.volumes map) still believes the volume is staged. Subsequent NodeStageVolume calls return early with "volume ... has been already staged", even though the mountpoint is broken (e.g., "Transport endpoint is not connected"). This leads to persistent pod and application failures, as the broken mount is never cleaned up or restaged.
Example log sequence:
First, the logs start with:
E0828 16:56:08.158739 mounter.go:75 weed mount exit, pid: 26, path: /local/csi/staging/default/default/rw-file-system-multi-node-multi-writer, error: signal: killed
Then, my apps start crashing bc the volume is gone and I get OSError: [Errno 107] Transport endpoint is not connected: '/remote/spot'
That is then followed by:
I0828 16:57:57.649996 nodeserver.go:175 volume default successfully unpublished from /local/csi/per-alloc/...
Then, the app restarts and requests a claim, prompting the controller to request the volume 'default' for the machine:
I0828 16:58:23.664292 controllserver.go:94 controller publish volume req, volume: default, node: MI-04C
Then, the nodeserver tries to stage the volume:
I0828 16:58:23.666640 nodeserver.go:34 node stage volume default to /local/csi/staging/default/default/rw-file-system-multi-node-multi-writer
However, we get a response that the volume has already been staged:
I0828 16:58:23.666752 nodeserver.go:56 volume default has been already staged
Therefore, the volume is never re-staged, and the mount remains broken.
Root Cause
// The volume has been staged.
if _, ok := ns.volumes.Load(volumeID); ok {
glog.Infof("volume %s has been already staged", volumeID)
return &csi.NodeStageVolumeResponse{}, nil
}The staging logic only checks the in-memory ns.volumes map, not the actual state of the mountpoint. If the mount is dead, the volume stays stuck as 'already staged' until the node driver restarts or NodeUnstageVolume is called manually.
Proposed Solution
Before returning early in NodeStageVolume due to an entry in ns.volumes, the driver should validate that the staging path is actually a healthy mount. If it is not (e.g., not a mount, or kernel reports errors like 'transport endpoint is not connected'), it should clean up the broken mount, remove the entry from ns.volumes, and proceed to restage the volume.
Suggested code snippet (pseudocode):
if _, ok := ns.volumes.Load(volumeID); ok {
notMnt, err := mountutil.IsLikelyNotMountPoint(stagingTargetPath)
if err != nil {
if mount.IsCorruptedMnt(err) {
glog.Warningf("staging path %s is a corrupted mount: %v; cleaning up", stagingTargetPath, err)
_ = mount.CleanupMountPoint(stagingTargetPath, mountutil, true)
ns.volumes.Delete(volumeID)
} else {
return nil, status.Errorf(codes.Internal, "failed to check mountpoint %s: %v", stagingTargetPath, err)
}
} else if !notMnt {
glog.Infof("volume %s has been already staged", volumeID)
return &csi.NodeStageVolumeResponse{}, nil
} else {
glog.Warningf("staging path %s exists but is not mounted; will restage", stagingTargetPath)
ns.volumes.Delete(volumeID)
}
}This should allow the driver to recover from FUSE crashes without requiring node restarts or manual intervention. Similar logic could be added to NodePublishVolume to ensure the staging mount is healthy before bind-mounting.