Skip to content

fix(spdk): make engine frontend recovery async to prevent pod crash loop (backport #539)#543

Merged
derekbit merged 2 commits into
v1.12.xfrom
mergify/bp/v1.12.x/pr-539
May 25, 2026
Merged

fix(spdk): make engine frontend recovery async to prevent pod crash loop (backport #539)#543
derekbit merged 2 commits into
v1.12.xfrom
mergify/bp/v1.12.x/pr-539

Conversation

@mergify
Copy link
Copy Markdown

@mergify mergify Bot commented May 25, 2026

Which issue(s) this PR fixes:

Issue longhorn/longhorn#13185

What this PR does / why we need it:

When an instance-manager pod restarts with persisted engine frontend records referencing unreachable NVMe-TCP targets, the synchronous recoverEngineFrontends() blocks gRPC startup, exceeding the liveness probe threshold and causing an infinite pod create/delete loop.

The fix runs recovery as a goroutine so gRPC serves immediately. A TCP dial pre-check (5s timeout) in the recovery path detects unreachable targets early, avoiding the full 60s nvme-connect retry loop. Since recovery now runs concurrently with incoming gRPC requests, EngineFrontendCreate may race with the recovery goroutine for the same volume. To handle this safely, Create checks whether a conflicting map entry is still in Pending state (i.e. mid-recovery) and if so marks it Terminating and removes it from the map — letting the recovery goroutine perform the actual resource teardown sequentially after RecoverFromHost returns, which avoids concurrent access to the NVMe initiator. The recovery loop itself verifies map ownership (pointer equality) before deleting entries in both the failure and success paths, so it never accidentally removes a new frontend that was registered by a concurrent Create under the same name or volume.

Special notes for your reviewer:

Additional documentation or context


This is an automatic backport of pull request #539 done by [Mergify](https://mergify.com).

derekbit added 2 commits May 25, 2026 03:54
…ation support

Make recoverEngineFrontends() run in a separate goroutine so that gRPC
servers can start immediately. This prevents the liveness probe from
killing the pod when persisted targets are unreachable.

Add per-volume host locks (volumeHostLocks) to serialize host-level
NVMe/dm operations for the same volume. Recovery and all frontend
lifecycle RPCs that mutate host NVMe controllers or dm devices (create,
delete, suspend, resume, expand, switchover) acquire the per-volume lock
so that these operations cannot overlap on one volume.

Longhorn 13185

Signed-off-by: Derek Su <derek.su@suse.com>
(cherry picked from commit 93c47a7)
When EngineFrontendCreate encounters an existing frontend in Pending
state (still being recovered asynchronously), evict it instead of
returning AlreadyExists. This allows the new Create to proceed
immediately without waiting for the potentially slow recovery to
complete or fail.

Longhorn 13185

Signed-off-by: Derek Su <derek.su@suse.com>
(cherry picked from commit dfac075)
@derekbit derekbit merged commit 0b32747 into v1.12.x May 25, 2026
8 of 9 checks passed
@derekbit derekbit deleted the mergify/bp/v1.12.x/pr-539 branch May 25, 2026 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant