Bug: Restore stuck in Finalizing phase indefinitely due to unbounded hook-tracker wait

**Summary**
- A Velero restore with post-restore exec hooks can get permanently stuck in the Finalizing phase. 
- The WaitRestoreExecHook function polls an in-memory MultiHookTracker with no timeout, so if a hook is registered (Add) but never recorded as executed (Record), the subsequent restore CR hangs forever in finalizing phase. 
- The only recovery is restarting the Velero controller pod.

I had a restore ongoing where it was stuck in a loop to execute a hook, even after the restore CR expired this logger kept coming up as it was checking in memory.

**Steps to Reproduce**
- Create a restore with a post-restore exec hook (e.g., targeting container webcontainer).
- The hook is registered in the in-memory MultiHookTracker via Add() during pod restore.
- If the async hook goroutine fails to call Record() for any reason (pod evicted, container never becomes ready, goroutine panic, previous restore's tracker entry leaked), the tracker entry stays in hookExecuted: false state.
- The restore finalizer controller enters WaitRestoreExecHook() and polls IsComplete() every 1 second forever.

**Observed Behavior**
- Restore CR remains in Phase: Finalizing indefinitely.
- Velero controller logs show repeated Checking the progress of hooks execution messages, even for a different, already deleted restore CR

**Expected Behavior**
The restore should complete (or move to PartiallyFailed) within a bounded time, even if hook tracking state becomes orphaned.

https://github.com/velero-io/velero/blob/7549408e210529a7dc81dd2c097734ee8cf257ba/pkg/controller/restore_finalizer_controller.go#L563-L569

Why it blocks other restores:
The finalizer controller is a single-threaded controller-runtime reconciler. 
While one reconcile loop is stuck in the infinite PollUntilContextCancel, no other Finalizing restore can be reconciled.


**Anything else you would like to add:**
- MultiHookTracker.IsComplete() in hook_tracker.go (line 231) returns true only when hookAttemptedCnt == hookExecutedCnt. 
- If a hook was Add()-ed but Record() was never called, this condition is never met.

**Suggested Fix**
Replace context.Background() with a bounded context, using the restore's ItemOperationTimeout (or a sensible default) as the upper bound:
```
timeout := ctx.resourceTimeout // or derive from restore.Spec.ItemOperationTimeout
waitCtx, waitCancel := context.WithTimeout(context.Background(), timeout)
defer waitCancel()

err := wait.PollUntilContextCancel(waitCtx, 1*time.Second, true, func(context.Context) (bool, error) {
    ...
})
```
Additionally, consider:

- Calling multiHookTracker.Delete(restore.Name) in a defer at the start of finalization, so leaked entries are always cleaned up.
- Adding a periodic check that the Restore CR still exists, to break out of the loop if the CR is deleted.

**Workaround**
Restart the Velero controller pod. On restart the in-memory MultiHookTracker is empty, so IsComplete() returns true immediately

**Environment:**

- Velero version (use `velero version`): 1.16
- Kubernetes version (use `kubectl version`): 1.34


**Vote on this issue!**

This is an invitation to the Velero community to vote on issues, you can see the project's [top voted issues listed here](https://github.com/vmware-tanzu/velero/issues?q=is%3Aissue+is%3Aopen+sort%3Areactions-%2B1-desc).  
Use the "reaction smiley face" up to the right of this comment to vote.

- :+1: for "I would like to see this bug fixed as soon as possible"
- :-1: for "There are more important bugs to focus on right now"


	err := wait.PollUntilContextCancel(context.Background(), 1*time.Second, true, func(context.Context) (bool, error) {
	log.Debug("Checking the progress of hooks execution")
	if ctx.multiHookTracker.IsComplete(ctx.restore.Name) {
	return true, nil
	}
	return false, nil
	})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Restore stuck in Finalizing phase indefinitely due to unbounded hook-tracker wait #9744

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Restore stuck in Finalizing phase indefinitely due to unbounded hook-tracker wait #9744

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions