Skip to content

Commit 1fd4356

Browse files
committed
reclaimspacejob: requeue with delay when node client is not found
When a ReclaimSpaceJob targets a PVC whose application pod has been deleted, the VolumeAttachment may still show the volume as attached while the node's CSI addons sidecar connection is no longer available in the connection pool. This causes nodeReclaimSpace to fail with "node client not found". Previously, this error was returned directly to controller-runtime, which requeues with a fast rate limiter (~5ms exponential backoff). All 6 retries would exhaust in ~315ms, far too quickly for the VolumeAttachment to be updated, causing the job to fail immediately. Requeue with a 30-second interval for this specific error, giving sufficient time for the VolumeAttachment to be cleaned up. On subsequent reconciles, getTargetDetails will no longer find the stale VolumeAttachment and the job can proceed with controller-only reclaim space. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
1 parent fae2bcc commit 1fd4356

1 file changed

Lines changed: 17 additions & 1 deletion

File tree

internal/controller/csiaddons/reclaimspacejob_controller.go

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,17 @@ const (
5959
// failed reason type.
6060
// TODO: add more useful reason types.
6161
reasonFailed = "failed"
62+
63+
// nodeClientRequeueInterval is the interval to requeue when the
64+
// node client is not found in the connection pool, allowing time
65+
// for the VolumeAttachment to be updated.
66+
nodeClientRequeueInterval = 30 * time.Second
6267
)
6368

69+
// errNodeClientNotFound is a sentinel error returned when the node
70+
// client for the given nodeID is not found in the connection pool.
71+
var errNodeClientNotFound = errors.New("node client not found")
72+
6473
// ReclaimSpaceJobReconciler reconciles a ReclaimSpaceJob object.
6574
type ReclaimSpaceJobReconciler struct {
6675
client.Client
@@ -159,6 +168,13 @@ func (r *ReclaimSpaceJobReconciler) Reconcile(ctx context.Context, req ctrl.Requ
159168
return ctrl.Result{}, nil
160169
}
161170

171+
// If the node client is not found, requeue after an interval
172+
// to allow the VolumeAttachment to be updated instead of using
173+
// the fast rate-limited requeue.
174+
if errors.Is(err, errNodeClientNotFound) {
175+
return ctrl.Result{RequeueAfter: nodeClientRequeueInterval}, nil
176+
}
177+
162178
return ctrl.Result{}, err
163179
}
164180

@@ -445,7 +461,7 @@ func (r *ReclaimSpaceJobReconciler) nodeReclaimSpace(
445461
return nil, err
446462
}
447463
if nodeClient == nil {
448-
return nil, fmt.Errorf("node Client not found for %q nodeID", target.nodeID)
464+
return nil, fmt.Errorf("%w for %q nodeID", errNodeClientNotFound, target.nodeID)
449465
}
450466
*logger = logger.WithValues("nodeClient", clientName)
451467

0 commit comments

Comments
 (0)