fix(spdk): make engine frontend recovery async to prevent pod crash loop (backport #539)#543
Merged
Merged
Conversation
…ation support Make recoverEngineFrontends() run in a separate goroutine so that gRPC servers can start immediately. This prevents the liveness probe from killing the pod when persisted targets are unreachable. Add per-volume host locks (volumeHostLocks) to serialize host-level NVMe/dm operations for the same volume. Recovery and all frontend lifecycle RPCs that mutate host NVMe controllers or dm devices (create, delete, suspend, resume, expand, switchover) acquire the per-volume lock so that these operations cannot overlap on one volume. Longhorn 13185 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit 93c47a7)
When EngineFrontendCreate encounters an existing frontend in Pending state (still being recovered asynchronously), evict it instead of returning AlreadyExists. This allows the new Create to proceed immediately without waiting for the potentially slow recovery to complete or fail. Longhorn 13185 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit dfac075)
derekbit
approved these changes
May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue(s) this PR fixes:
Issue longhorn/longhorn#13185
What this PR does / why we need it:
When an instance-manager pod restarts with persisted engine frontend records referencing unreachable NVMe-TCP targets, the synchronous recoverEngineFrontends() blocks gRPC startup, exceeding the liveness probe threshold and causing an infinite pod create/delete loop.
The fix runs recovery as a goroutine so gRPC serves immediately. A TCP dial pre-check (5s timeout) in the recovery path detects unreachable targets early, avoiding the full 60s nvme-connect retry loop. Since recovery now runs concurrently with incoming gRPC requests, EngineFrontendCreate may race with the recovery goroutine for the same volume. To handle this safely, Create checks whether a conflicting map entry is still in Pending state (i.e. mid-recovery) and if so marks it Terminating and removes it from the map — letting the recovery goroutine perform the actual resource teardown sequentially after RecoverFromHost returns, which avoids concurrent access to the NVMe initiator. The recovery loop itself verifies map ownership (pointer equality) before deleting entries in both the failure and success paths, so it never accidentally removes a new frontend that was registered by a concurrent Create under the same name or volume.
Special notes for your reviewer:
Additional documentation or context
This is an automatic backport of pull request #539 done by [Mergify](https://mergify.com).