Hi Cluster API Provider IBMCloud maintainers,
I think there may be a small race/idempotency issue in the
IBMPowerVSMachine deletion path. This looks similar in shape to the
IBMVPCMachine delete issue I reported separately, but it goes through the
PowerVS controller and service wrapper.
When an IBMPowerVSMachine is being deleted, the reconciler reads the machine
object, dispatches to reconcileDelete, calls scope.DeleteMachine(), and
removes the finalizer only after the external delete succeeds. The finalizer
removal is then persisted by the deferred patch helper.
If the next reconcile runs before the controller-runtime cache has observed the
finalizer-removal patch, it can read the previous cached object where
deletionTimestamp, the finalizer, and Status.InstanceID are still present.
In that case, the deletion path can call the PowerVS DeleteInstance API a
second time for the already deleted external instance.
The path I am looking at is:
if !ibmPowerVSMachine.ObjectMeta.DeletionTimestamp.IsZero() {
return r.reconcileDelete(ctx, machineScope)
}
and in the delete path:
defer func() {
if reterr == nil {
controllerutil.RemoveFinalizer(scope.IBMPowerVSMachine, infrav1.IBMPowerVSMachineFinalizer)
}
}()
if scope.IBMPowerVSMachine.Status.InstanceID == "" {
return ctrl.Result{}, nil
}
if err := scope.DeleteMachine(); err != nil {
return ctrl.Result{}, fmt.Errorf("error deleting IBMPowerVSMachine %v: %w", ...)
}
The PowerVS delete call uses Status.InstanceID and returns delete errors
directly:
func (s *Service) DeleteInstance(id string) error {
return s.instanceClient.Delete(id)
}
A possible interleaving is:
first reconcile:
cached IBMPowerVSMachine has deletionTimestamp, finalizer, and Status.InstanceID
DeleteInstance(Status.InstanceID) succeeds
RemoveFinalizer mutates the object
deferred patch persists the finalizer removal
second reconcile before the cache observes the finalizer-removal patch:
cached IBMPowerVSMachine still has deletionTimestamp, finalizer, and Status.InstanceID
DeleteInstance(Status.InstanceID) is called again
provider returns instance-not-found
delete error propagates and the finalizer remains
I reproduced this interleaving with a small adversarial cache-lag model. In that
model, the second reconcile fails with:
DeleteInstance(powervs-instance-id) returned instance-not-found
The existing PowerVS tests also seem to model delete errors as fatal in this
path. For example, a DeleteInstance error is treated as a non-nil
DeleteMachine error.
A minimal idempotency fix may be to treat provider instance-not-found errors as
successful cleanup in the IBMPowerVSMachine deletion path. A stronger
controller-level fix would be to avoid running the external delete again until
the cache has observed the finalizer-removal patch from the previous successful
delete. The idempotency fix seems useful on its own because repeated cleanup
attempts can happen during normal controller retries/cache lag.
Please let me know if I am missing an intended behavior here. I can also provide
the small reproduction model if helpful.
Hi Cluster API Provider IBMCloud maintainers,
I think there may be a small race/idempotency issue in the
IBMPowerVSMachinedeletion path. This looks similar in shape to theIBMVPCMachinedelete issue I reported separately, but it goes through thePowerVS controller and service wrapper.
When an
IBMPowerVSMachineis being deleted, the reconciler reads the machineobject, dispatches to
reconcileDelete, callsscope.DeleteMachine(), andremoves the finalizer only after the external delete succeeds. The finalizer
removal is then persisted by the deferred patch helper.
If the next reconcile runs before the controller-runtime cache has observed the
finalizer-removal patch, it can read the previous cached object where
deletionTimestamp, the finalizer, andStatus.InstanceIDare still present.In that case, the deletion path can call the PowerVS
DeleteInstanceAPI asecond time for the already deleted external instance.
The path I am looking at is:
and in the delete path:
The PowerVS delete call uses
Status.InstanceIDand returns delete errorsdirectly:
A possible interleaving is:
I reproduced this interleaving with a small adversarial cache-lag model. In that
model, the second reconcile fails with:
The existing PowerVS tests also seem to model delete errors as fatal in this
path. For example, a
DeleteInstanceerror is treated as a non-nilDeleteMachineerror.A minimal idempotency fix may be to treat provider instance-not-found errors as
successful cleanup in the
IBMPowerVSMachinedeletion path. A strongercontroller-level fix would be to avoid running the external delete again until
the cache has observed the finalizer-removal patch from the previous successful
delete. The idempotency fix seems useful on its own because repeated cleanup
attempts can happen during normal controller retries/cache lag.
Please let me know if I am missing an intended behavior here. I can also provide
the small reproduction model if helpful.