Skip to content

Bug: Restore stuck in Finalizing phase indefinitely due to unbounded hook-tracker wait #9744

@priyansh17

Description

@priyansh17

Summary

  • A Velero restore with post-restore exec hooks can get permanently stuck in the Finalizing phase.
  • The WaitRestoreExecHook function polls an in-memory MultiHookTracker with no timeout, so if a hook is registered (Add) but never recorded as executed (Record), the subsequent restore CR hangs forever in finalizing phase.
  • The only recovery is restarting the Velero controller pod.

I had a restore ongoing where it was stuck in a loop to execute a hook, even after the restore CR expired this logger kept coming up as it was checking in memory.

Steps to Reproduce

  • Create a restore with a post-restore exec hook (e.g., targeting container webcontainer).
  • The hook is registered in the in-memory MultiHookTracker via Add() during pod restore.
  • If the async hook goroutine fails to call Record() for any reason (pod evicted, container never becomes ready, goroutine panic, previous restore's tracker entry leaked), the tracker entry stays in hookExecuted: false state.
  • The restore finalizer controller enters WaitRestoreExecHook() and polls IsComplete() every 1 second forever.

Observed Behavior

  • Restore CR remains in Phase: Finalizing indefinitely.
  • Velero controller logs show repeated Checking the progress of hooks execution messages, even for a different, already deleted restore CR

Expected Behavior
The restore should complete (or move to PartiallyFailed) within a bounded time, even if hook tracking state becomes orphaned.

err := wait.PollUntilContextCancel(context.Background(), 1*time.Second, true, func(context.Context) (bool, error) {
log.Debug("Checking the progress of hooks execution")
if ctx.multiHookTracker.IsComplete(ctx.restore.Name) {
return true, nil
}
return false, nil
})

Why it blocks other restores:
The finalizer controller is a single-threaded controller-runtime reconciler.
While one reconcile loop is stuck in the infinite PollUntilContextCancel, no other Finalizing restore can be reconciled.

Anything else you would like to add:

  • MultiHookTracker.IsComplete() in hook_tracker.go (line 231) returns true only when hookAttemptedCnt == hookExecutedCnt.
  • If a hook was Add()-ed but Record() was never called, this condition is never met.

Suggested Fix
Replace context.Background() with a bounded context, using the restore's ItemOperationTimeout (or a sensible default) as the upper bound:

timeout := ctx.resourceTimeout // or derive from restore.Spec.ItemOperationTimeout
waitCtx, waitCancel := context.WithTimeout(context.Background(), timeout)
defer waitCancel()

err := wait.PollUntilContextCancel(waitCtx, 1*time.Second, true, func(context.Context) (bool, error) {
    ...
})

Additionally, consider:

  • Calling multiHookTracker.Delete(restore.Name) in a defer at the start of finalization, so leaked entries are always cleaned up.
  • Adding a periodic check that the Restore CR still exists, to break out of the loop if the CR is deleted.

Workaround
Restart the Velero controller pod. On restart the in-memory MultiHookTracker is empty, so IsComplete() returns true immediately

Environment:

  • Velero version (use velero version): 1.16
  • Kubernetes version (use kubectl version): 1.34

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions