Hi Cluster API Provider IBMCloud maintainers,
I think there may be a small race/idempotency issue in the IBMVPCMachine
deletion path.
When an IBMVPCMachine is being deleted, the reconciler reads the machine
object, dispatches to reconcileDelete, calls scope.DeleteMachine(), and
removes the finalizer only after the external delete succeeds. The finalizer
removal is then persisted by the deferred patch helper.
If the next reconcile runs before the controller-runtime cache has observed the
finalizer-removal patch, it can read the previous cached object where
deletionTimestamp, the finalizer, and Status.InstanceID are still present.
In that case, the deletion path can call the VPC DeleteInstance API a second
time for the already deleted external instance.
The path I am looking at is:
if !ibmVPCMachine.DeletionTimestamp.IsZero() {
return r.reconcileDelete(ctx, machineScope)
}
and in the delete path:
if err := scope.DeleteMachine(); err != nil {
return ctrl.Result{}, fmt.Errorf("error deleting IBMVPCMachine %s/%s: %w", ...)
}
defer func() {
if reterr == nil {
controllerutil.RemoveFinalizer(scope.IBMVPCMachine, infrav1.MachineFinalizer)
}
}()
The VPC delete call uses Status.InstanceID and returns the provider error:
options := &vpcv1.DeleteInstanceOptions{}
options.SetID(m.IBMVPCMachine.Status.InstanceID)
_, err := m.IBMVPCClient.DeleteInstance(options)
return err
A possible interleaving is:
first reconcile:
cached IBMVPCMachine has deletionTimestamp, finalizer, and Status.InstanceID
DeleteInstance(Status.InstanceID) succeeds
RemoveFinalizer mutates the object
deferred patch persists the finalizer removal
second reconcile before the cache observes the finalizer-removal patch:
cached IBMVPCMachine still has deletionTimestamp, finalizer, and Status.InstanceID
DeleteInstance(Status.InstanceID) is called again
provider returns instance-not-found
delete error propagates and the finalizer remains
I reproduced this interleaving with a small adversarial cache-lag model. In that
model, the second reconcile fails with:
DeleteInstance(capi-instance-id) returned instance-not-found
The existing tests also seem to model delete errors as fatal in this path. For
example, DeleteMachine returns a non-nil error when the VPC DeleteInstance
call returns an error, and the controller tests distinguish successful deletion
from delete errors when deciding whether the finalizer is removed.
A minimal idempotency fix may be to treat provider instance-not-found errors as
successful cleanup in the IBMVPCMachine deletion path. A stronger
controller-level fix would be to avoid running the external delete again until
the cache has observed the finalizer-removal patch from the previous successful
delete. The idempotency fix seems useful on its own because repeated cleanup
attempts can happen during normal controller retries/cache lag.
Please let me know if I am missing an intended behavior here. I can also provide
the small reproduction model if helpful.
Hi Cluster API Provider IBMCloud maintainers,
I think there may be a small race/idempotency issue in the
IBMVPCMachinedeletion path.
When an
IBMVPCMachineis being deleted, the reconciler reads the machineobject, dispatches to
reconcileDelete, callsscope.DeleteMachine(), andremoves the finalizer only after the external delete succeeds. The finalizer
removal is then persisted by the deferred patch helper.
If the next reconcile runs before the controller-runtime cache has observed the
finalizer-removal patch, it can read the previous cached object where
deletionTimestamp, the finalizer, andStatus.InstanceIDare still present.In that case, the deletion path can call the VPC
DeleteInstanceAPI a secondtime for the already deleted external instance.
The path I am looking at is:
and in the delete path:
The VPC delete call uses
Status.InstanceIDand returns the provider error:A possible interleaving is:
I reproduced this interleaving with a small adversarial cache-lag model. In that
model, the second reconcile fails with:
The existing tests also seem to model delete errors as fatal in this path. For
example,
DeleteMachinereturns a non-nil error when the VPCDeleteInstancecall returns an error, and the controller tests distinguish successful deletion
from delete errors when deciding whether the finalizer is removed.
A minimal idempotency fix may be to treat provider instance-not-found errors as
successful cleanup in the
IBMVPCMachinedeletion path. A strongercontroller-level fix would be to avoid running the external delete again until
the cache has observed the finalizer-removal patch from the previous successful
delete. The idempotency fix seems useful on its own because repeated cleanup
attempts can happen during normal controller retries/cache lag.
Please let me know if I am missing an intended behavior here. I can also provide
the small reproduction model if helpful.