fix(spdk): make engine frontend recovery async to prevent pod crash loop#539
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses instance-manager pod crash loops caused by synchronous engine-frontend recovery blocking gRPC startup when persisted NVMe-TCP targets are unreachable. It makes recovery asynchronous and adds additional safeguards to reduce long connect retry delays and handle races between recovery and new EngineFrontendCreate calls.
Changes:
- Run
recoverEngineFrontends()asynchronously during server startup to allow gRPC to start serving immediately. - Add race-safety checks in recovery and
EngineFrontendCreateto avoid deleting/replacing newer frontends when names/volumes collide during concurrent recovery. - Add a TCP reachability pre-check (5s dial timeout) in the recovery reconnect path to avoid lengthy NVMe reconnect retry loops for unreachable targets.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/spdk/server.go | Starts engine frontend recovery asynchronously during server initialization. |
| pkg/spdk/server_enginefrontend.go | Adjusts create logic to evict Pending (recovering) frontends safely on name/volume conflicts. |
| pkg/spdk/enginefrontend.go | Adds recovery reachability pre-check and prevents deferred recovery state updates from clobbering concurrent state changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f48d136 to
deeb6e9
Compare
…ation support Make recoverEngineFrontends() run in a separate goroutine so that gRPC servers can start immediately. This prevents the liveness probe from killing the pod when persisted targets are unreachable. Add per-volume host locks (volumeHostLocks) to serialize host-level NVMe/dm operations for the same volume. Recovery and all frontend lifecycle RPCs that mutate host NVMe controllers or dm devices (create, delete, suspend, resume, expand, switchover) acquire the per-volume lock so that these operations cannot overlap on one volume. Longhorn 13185 Signed-off-by: Derek Su <derek.su@suse.com>
When EngineFrontendCreate encounters an existing frontend in Pending state (still being recovered asynchronously), evict it instead of returning AlreadyExists. This allows the new Create to proceed immediately without waiting for the potentially slow recovery to complete or fail. Longhorn 13185 Signed-off-by: Derek Su <derek.su@suse.com>
Member
Author
Member
Author
|
cc @davidcheng0922 for review. |
Member
Author
|
@mergify backport v1.12.x |
✅ Backports have been createdDetails
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue(s) this PR fixes:
Issue longhorn/longhorn#13185
What this PR does / why we need it:
When an instance-manager pod restarts with persisted engine frontend records referencing unreachable NVMe-TCP targets, the synchronous recoverEngineFrontends() blocks gRPC startup, exceeding the liveness probe threshold and causing an infinite pod create/delete loop.
The fix runs recovery as a goroutine so gRPC serves immediately. A TCP dial pre-check (5s timeout) in the recovery path detects unreachable targets early, avoiding the full 60s nvme-connect retry loop. Since recovery now runs concurrently with incoming gRPC requests, EngineFrontendCreate may race with the recovery goroutine for the same volume. To handle this safely, Create checks whether a conflicting map entry is still in Pending state (i.e. mid-recovery) and if so marks it Terminating and removes it from the map — letting the recovery goroutine perform the actual resource teardown sequentially after RecoverFromHost returns, which avoids concurrent access to the NVMe initiator. The recovery loop itself verifies map ownership (pointer equality) before deleting entries in both the failure and success paths, so it never accidentally removes a new frontend that was registered by a concurrent Create under the same name or volume.
Special notes for your reviewer:
Additional documentation or context