Summary
- A Velero restore with post-restore exec hooks can get permanently stuck in the Finalizing phase.
- The WaitRestoreExecHook function polls an in-memory MultiHookTracker with no timeout, so if a hook is registered (Add) but never recorded as executed (Record), the subsequent restore CR hangs forever in finalizing phase.
- The only recovery is restarting the Velero controller pod.
I had a restore ongoing where it was stuck in a loop to execute a hook, even after the restore CR expired this logger kept coming up as it was checking in memory.
Steps to Reproduce
- Create a restore with a post-restore exec hook (e.g., targeting container webcontainer).
- The hook is registered in the in-memory MultiHookTracker via Add() during pod restore.
- If the async hook goroutine fails to call Record() for any reason (pod evicted, container never becomes ready, goroutine panic, previous restore's tracker entry leaked), the tracker entry stays in hookExecuted: false state.
- The restore finalizer controller enters WaitRestoreExecHook() and polls IsComplete() every 1 second forever.
Observed Behavior
- Restore CR remains in Phase: Finalizing indefinitely.
- Velero controller logs show repeated Checking the progress of hooks execution messages, even for a different, already deleted restore CR
Expected Behavior
The restore should complete (or move to PartiallyFailed) within a bounded time, even if hook tracking state becomes orphaned.
|
err := wait.PollUntilContextCancel(context.Background(), 1*time.Second, true, func(context.Context) (bool, error) { |
|
log.Debug("Checking the progress of hooks execution") |
|
if ctx.multiHookTracker.IsComplete(ctx.restore.Name) { |
|
return true, nil |
|
} |
|
return false, nil |
|
}) |
Why it blocks other restores:
The finalizer controller is a single-threaded controller-runtime reconciler.
While one reconcile loop is stuck in the infinite PollUntilContextCancel, no other Finalizing restore can be reconciled.
Anything else you would like to add:
- MultiHookTracker.IsComplete() in hook_tracker.go (line 231) returns true only when hookAttemptedCnt == hookExecutedCnt.
- If a hook was Add()-ed but Record() was never called, this condition is never met.
Suggested Fix
Replace context.Background() with a bounded context, using the restore's ItemOperationTimeout (or a sensible default) as the upper bound:
timeout := ctx.resourceTimeout // or derive from restore.Spec.ItemOperationTimeout
waitCtx, waitCancel := context.WithTimeout(context.Background(), timeout)
defer waitCancel()
err := wait.PollUntilContextCancel(waitCtx, 1*time.Second, true, func(context.Context) (bool, error) {
...
})
Additionally, consider:
- Calling multiHookTracker.Delete(restore.Name) in a defer at the start of finalization, so leaked entries are always cleaned up.
- Adding a periodic check that the Restore CR still exists, to break out of the loop if the CR is deleted.
Workaround
Restart the Velero controller pod. On restart the in-memory MultiHookTracker is empty, so IsComplete() returns true immediately
Environment:
- Velero version (use
velero version): 1.16
- Kubernetes version (use
kubectl version): 1.34
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
- 👍 for "I would like to see this bug fixed as soon as possible"
- 👎 for "There are more important bugs to focus on right now"
Summary
I had a restore ongoing where it was stuck in a loop to execute a hook, even after the restore CR expired this logger kept coming up as it was checking in memory.
Steps to Reproduce
Observed Behavior
Expected Behavior
The restore should complete (or move to PartiallyFailed) within a bounded time, even if hook tracking state becomes orphaned.
velero/pkg/controller/restore_finalizer_controller.go
Lines 563 to 569 in 7549408
Why it blocks other restores:
The finalizer controller is a single-threaded controller-runtime reconciler.
While one reconcile loop is stuck in the infinite PollUntilContextCancel, no other Finalizing restore can be reconciled.
Anything else you would like to add:
Suggested Fix
Replace context.Background() with a bounded context, using the restore's ItemOperationTimeout (or a sensible default) as the upper bound:
Additionally, consider:
Workaround
Restart the Velero controller pod. On restart the in-memory MultiHookTracker is empty, so IsComplete() returns true immediately
Environment:
velero version): 1.16kubectl version): 1.34Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.