Skip to content

[BUG] v2 instance-manager pod stuck in create/delete loop when engine frontend recovery blocks gRPC startup #13185

@derekbit

Description

@derekbit

Describe the Bug

When an instance-manager pod (v2 data engine) restarts while persisted engine frontend records reference NVMe-TCP targets that are no longer reachable (e.g., deleted volume, migrated engine), the recoverEngineFrontends() function blocks NewServer() synchronously. This prevents gRPC servers from starting, causing the liveness probe to fail and kubelet to kill the container. Since the pod has restartPolicy: Never, the Longhorn controller deletes and recreates the pod, but the same stale metadata still exists, creating an infinite crash loop.

To Reproduce

  1. Use Longhorn v1.12.x with v2 (SPDK) data engine enabled
  2. Create ten v2 volumes and attach them, then delete it while the instance-manager pod is being recycled (or force-kill the IM pod while the volume is being deleted)
  3. The instance-manager pod enters a create/delete loop.

Expected Behavior

The instance-manager pod should start serving gRPC immediately regardless of engine frontend recovery status. Recovery of stale/unreachable targets should fail fast and clean up without affecting pod health.

Support Bundle for Troubleshooting

N/A

Environment

  • Longhorn version:
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version:
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

No response

Workaround and Mitigation

No response

Metadata

Metadata

Labels

area/resilienceSystem or volume resiliencearea/v2-data-enginev2 data engine (SPDK)kind/bugpriority/0Must be implement or fixed in this release (managed by PO)require/auto-e2e-testRequire adding/updating auto e2e test cases if they can be automated

Type

No fields configured for Bug.

Projects

Status

Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions