IBMVPCMachine deletion can fail on repeated instance delete when cached finalizer/status is stale

Hi Cluster API Provider IBMCloud maintainers,

I think there may be a small race/idempotency issue in the `IBMVPCMachine`
deletion path.

When an `IBMVPCMachine` is being deleted, the reconciler reads the machine
object, dispatches to `reconcileDelete`, calls `scope.DeleteMachine()`, and
removes the finalizer only after the external delete succeeds. The finalizer
removal is then persisted by the deferred patch helper.

If the next reconcile runs before the controller-runtime cache has observed the
finalizer-removal patch, it can read the previous cached object where
`deletionTimestamp`, the finalizer, and `Status.InstanceID` are still present.
In that case, the deletion path can call the VPC `DeleteInstance` API a second
time for the already deleted external instance.

The path I am looking at is:

```go
if !ibmVPCMachine.DeletionTimestamp.IsZero() {
    return r.reconcileDelete(ctx, machineScope)
}
```

and in the delete path:

```go
if err := scope.DeleteMachine(); err != nil {
    return ctrl.Result{}, fmt.Errorf("error deleting IBMVPCMachine %s/%s: %w", ...)
}

defer func() {
    if reterr == nil {
        controllerutil.RemoveFinalizer(scope.IBMVPCMachine, infrav1.MachineFinalizer)
    }
}()
```

The VPC delete call uses `Status.InstanceID` and returns the provider error:

```go
options := &vpcv1.DeleteInstanceOptions{}
options.SetID(m.IBMVPCMachine.Status.InstanceID)
_, err := m.IBMVPCClient.DeleteInstance(options)
return err
```

A possible interleaving is:

```text
first reconcile:
  cached IBMVPCMachine has deletionTimestamp, finalizer, and Status.InstanceID
  DeleteInstance(Status.InstanceID) succeeds
  RemoveFinalizer mutates the object
  deferred patch persists the finalizer removal

second reconcile before the cache observes the finalizer-removal patch:
  cached IBMVPCMachine still has deletionTimestamp, finalizer, and Status.InstanceID
  DeleteInstance(Status.InstanceID) is called again
  provider returns instance-not-found
  delete error propagates and the finalizer remains
```

I reproduced this interleaving with a small adversarial cache-lag model. In that
model, the second reconcile fails with:

```text
DeleteInstance(capi-instance-id) returned instance-not-found
```

The existing tests also seem to model delete errors as fatal in this path. For
example, `DeleteMachine` returns a non-nil error when the VPC `DeleteInstance`
call returns an error, and the controller tests distinguish successful deletion
from delete errors when deciding whether the finalizer is removed.

A minimal idempotency fix may be to treat provider instance-not-found errors as
successful cleanup in the `IBMVPCMachine` deletion path. A stronger
controller-level fix would be to avoid running the external delete again until
the cache has observed the finalizer-removal patch from the previous successful
delete. The idempotency fix seems useful on its own because repeated cleanup
attempts can happen during normal controller retries/cache lag.

Please let me know if I am missing an intended behavior here. I can also provide
the small reproduction model if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IBMVPCMachine deletion can fail on repeated instance delete when cached finalizer/status is stale #2815

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

IBMVPCMachine deletion can fail on repeated instance delete when cached finalizer/status is stale #2815

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions