fix(spdk): make engine frontend recovery async to prevent pod crash loop (backport #539) by mergify[bot] · Pull Request #543 · longhorn/longhorn-spdk-engine

mergify · 2026-05-25T03:54:17Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

When an instance-manager pod restarts with persisted engine frontend records referencing unreachable NVMe-TCP targets, the synchronous recoverEngineFrontends() blocks gRPC startup, exceeding the liveness probe threshold and causing an infinite pod create/delete loop.

The fix runs recovery as a goroutine so gRPC serves immediately. A TCP dial pre-check (5s timeout) in the recovery path detects unreachable targets early, avoiding the full 60s nvme-connect retry loop. Since recovery now runs concurrently with incoming gRPC requests, EngineFrontendCreate may race with the recovery goroutine for the same volume. To handle this safely, Create checks whether a conflicting map entry is still in Pending state (i.e. mid-recovery) and if so marks it Terminating and removes it from the map — letting the recovery goroutine perform the actual resource teardown sequentially after RecoverFromHost returns, which avoids concurrent access to the NVMe initiator. The recovery loop itself verifies map ownership (pointer equality) before deleting entries in both the failure and success paths, so it never accidentally removes a new frontend that was registered by a concurrent Create under the same name or volume.

Special notes for your reviewer:

Additional documentation or context

This is an automatic backport of pull request #539 done by [Mergify](https://mergify.com).

…ation support Make recoverEngineFrontends() run in a separate goroutine so that gRPC servers can start immediately. This prevents the liveness probe from killing the pod when persisted targets are unreachable. Add per-volume host locks (volumeHostLocks) to serialize host-level NVMe/dm operations for the same volume. Recovery and all frontend lifecycle RPCs that mutate host NVMe controllers or dm devices (create, delete, suspend, resume, expand, switchover) acquire the per-volume lock so that these operations cannot overlap on one volume. Longhorn 13185 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit 93c47a7)

When EngineFrontendCreate encounters an existing frontend in Pending state (still being recovered asynchronously), evict it instead of returning AlreadyExists. This allows the new Create to proceed immediately without waiting for the potentially slow recovery to complete or fail. Longhorn 13185 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit dfac075)

derekbit added 2 commits May 25, 2026 03:54

mergify Bot mentioned this pull request May 25, 2026

fix(spdk): make engine frontend recovery async to prevent pod crash loop #539

Merged

derekbit approved these changes May 25, 2026

View reviewed changes

derekbit merged commit 0b32747 into v1.12.x May 25, 2026
8 of 9 checks passed

derekbit deleted the mergify/bp/v1.12.x/pr-539 branch May 25, 2026 04:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spdk): make engine frontend recovery async to prevent pod crash loop (backport #539)#543

fix(spdk): make engine frontend recovery async to prevent pod crash loop (backport #539)#543
derekbit merged 2 commits into
v1.12.xfrom
mergify/bp/v1.12.x/pr-539

mergify Bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mergify Bot commented May 25, 2026

Which issue(s) this PR fixes:

What this PR does / why we need it:

Special notes for your reviewer:

Additional documentation or context

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant