What steps did you take and what happened?
When a migration is retried (via the UI retry button, or by deleting the Migration CR), the MigrationPlan controller silently skips Job creation because a Job with the same name already exists from the previous attempt. No pod is ever created, and the Migration loops forever with "Migration pod not found yet, requeuing" every 30 seconds.
In CreateJob, the code that checks for an existing Job has an && logic bug:
err = r.Get(ctx, ..., job)
if err != nil && apierrors.IsNotFound(err) { // only enters block when NOT found
// create job
}
// if job already exists (err == nil), falls through and returns nil silently
Why retry worked before 4.1 and broke after? PR #1544
Before PR #1544, deleting a Migration set a DeletionTimestamp but the Migration wasn't immediately gone (it has a finalizer). CreateMigration got an AlreadyExists error and the reconcile requeued with backoff. During that backoff window, the finalizer was cleared, the Migration was fully deleted, GC deleted the old Job, and by the time the reconcile retried, CreateJob ran against a clean slate.
PR #1544 removed the DeleteFunc watch and made new Migrations start with Phase=Pending. Now the MigrationPlan reconcile is triggered by CreateFunc on the new Migration — which fires immediately after the old Migration is gone, before GC has had a chance to delete the old Job. The stale Job is found, creation is silently skipped, and the migration is stuck permanently.
What did you expect to happen?
In CreateJob: when a Job with the expected name already exists, delete it and proceed to create a fresh one.
Environment
pov env - VJAILB-147
vCenter version
N/A
Anything else you would like to add?
No response
What steps did you take and what happened?
When a migration is retried (via the UI retry button, or by deleting the Migration CR), the MigrationPlan controller silently skips Job creation because a Job with the same name already exists from the previous attempt. No pod is ever created, and the Migration loops forever with "Migration pod not found yet, requeuing" every 30 seconds.
In CreateJob, the code that checks for an existing Job has an && logic bug:
err = r.Get(ctx, ..., job)
if err != nil && apierrors.IsNotFound(err) { // only enters block when NOT found
// create job
}
// if job already exists (err == nil), falls through and returns nil silently
Why retry worked before 4.1 and broke after? PR #1544
Before PR #1544, deleting a Migration set a DeletionTimestamp but the Migration wasn't immediately gone (it has a finalizer). CreateMigration got an AlreadyExists error and the reconcile requeued with backoff. During that backoff window, the finalizer was cleared, the Migration was fully deleted, GC deleted the old Job, and by the time the reconcile retried, CreateJob ran against a clean slate.
PR #1544 removed the DeleteFunc watch and made new Migrations start with Phase=Pending. Now the MigrationPlan reconcile is triggered by CreateFunc on the new Migration — which fires immediately after the old Migration is gone, before GC has had a chance to delete the old Job. The stale Job is found, creation is silently skipped, and the migration is stuck permanently.
What did you expect to happen?
In CreateJob: when a Job with the expected name already exists, delete it and proceed to create a fresh one.
Environment
pov env - VJAILB-147
vCenter version
N/A
Anything else you would like to add?
No response