Retry leaves migration permanently stuck -> stale Job from previous attempt is never replaced

### What steps did you take and what happened?

When a migration is retried (via the UI retry button, or by deleting the Migration CR), the MigrationPlan controller silently skips Job creation because a Job with the same name already exists from the previous attempt. No  pod is ever created, and the Migration loops forever with "Migration pod not found yet, requeuing" every 30 seconds.

In CreateJob, the code that checks for an existing Job has an && logic bug:
  err = r.Get(ctx, ..., job)
  if err != nil && apierrors.IsNotFound(err) {   // only enters block when NOT found
      // create job
  }
  // if job already exists (err == nil), falls through and returns nil silently


Why retry worked before 4.1 and broke after? PR #1544

 Before PR #1544, deleting a Migration set a DeletionTimestamp but the Migration wasn't immediately gone (it has a finalizer). CreateMigration got an AlreadyExists error and the reconcile requeued with backoff. During that backoff window, the finalizer was cleared, the Migration was fully deleted, GC deleted the old Job, and by the time the reconcile retried, CreateJob ran against a clean slate.

PR #1544 removed the DeleteFunc watch and made new Migrations start with Phase=Pending. Now the MigrationPlan reconcile is triggered by CreateFunc on the new Migration — which fires immediately after the old Migration is gone, before GC has had a chance to delete the old Job. The stale Job is found, creation is silently skipped, and the migration is stuck permanently.



### What did you expect to happen?

In CreateJob: when a Job with the expected name already exists, delete it and proceed to create a fresh one.

### Environment

pov env - VJAILB-147

### vCenter version

N/A

### Anything else you would like to add?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry leaves migration permanently stuck -> stale Job from previous attempt is never replaced #1787

What steps did you take and what happened?

What did you expect to happen?

Environment

vCenter version

Anything else you would like to add?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retry leaves migration permanently stuck -> stale Job from previous attempt is never replaced #1787

Description

What steps did you take and what happened?

What did you expect to happen?

Environment

vCenter version

Anything else you would like to add?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions