Skip to content

IBMVPCMachine deletion can fail on repeated instance delete when cached finalizer/status is stale #2815

Description

@ChaoHuang2018

Hi Cluster API Provider IBMCloud maintainers,

I think there may be a small race/idempotency issue in the IBMVPCMachine
deletion path.

When an IBMVPCMachine is being deleted, the reconciler reads the machine
object, dispatches to reconcileDelete, calls scope.DeleteMachine(), and
removes the finalizer only after the external delete succeeds. The finalizer
removal is then persisted by the deferred patch helper.

If the next reconcile runs before the controller-runtime cache has observed the
finalizer-removal patch, it can read the previous cached object where
deletionTimestamp, the finalizer, and Status.InstanceID are still present.
In that case, the deletion path can call the VPC DeleteInstance API a second
time for the already deleted external instance.

The path I am looking at is:

if !ibmVPCMachine.DeletionTimestamp.IsZero() {
    return r.reconcileDelete(ctx, machineScope)
}

and in the delete path:

if err := scope.DeleteMachine(); err != nil {
    return ctrl.Result{}, fmt.Errorf("error deleting IBMVPCMachine %s/%s: %w", ...)
}

defer func() {
    if reterr == nil {
        controllerutil.RemoveFinalizer(scope.IBMVPCMachine, infrav1.MachineFinalizer)
    }
}()

The VPC delete call uses Status.InstanceID and returns the provider error:

options := &vpcv1.DeleteInstanceOptions{}
options.SetID(m.IBMVPCMachine.Status.InstanceID)
_, err := m.IBMVPCClient.DeleteInstance(options)
return err

A possible interleaving is:

first reconcile:
  cached IBMVPCMachine has deletionTimestamp, finalizer, and Status.InstanceID
  DeleteInstance(Status.InstanceID) succeeds
  RemoveFinalizer mutates the object
  deferred patch persists the finalizer removal

second reconcile before the cache observes the finalizer-removal patch:
  cached IBMVPCMachine still has deletionTimestamp, finalizer, and Status.InstanceID
  DeleteInstance(Status.InstanceID) is called again
  provider returns instance-not-found
  delete error propagates and the finalizer remains

I reproduced this interleaving with a small adversarial cache-lag model. In that
model, the second reconcile fails with:

DeleteInstance(capi-instance-id) returned instance-not-found

The existing tests also seem to model delete errors as fatal in this path. For
example, DeleteMachine returns a non-nil error when the VPC DeleteInstance
call returns an error, and the controller tests distinguish successful deletion
from delete errors when deciding whether the finalizer is removed.

A minimal idempotency fix may be to treat provider instance-not-found errors as
successful cleanup in the IBMVPCMachine deletion path. A stronger
controller-level fix would be to avoid running the external delete again until
the cache has observed the finalizer-removal patch from the previous successful
delete. The idempotency fix seems useful on its own because repeated cleanup
attempts can happen during normal controller retries/cache lag.

Please let me know if I am missing an intended behavior here. I can also provide
the small reproduction model if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions