Describe the Bug
When an instance-manager pod (v2 data engine) restarts while persisted engine frontend records reference NVMe-TCP targets that are no longer reachable (e.g., deleted volume, migrated engine), the recoverEngineFrontends() function blocks NewServer() synchronously. This prevents gRPC servers from starting, causing the liveness probe to fail and kubelet to kill the container. Since the pod has restartPolicy: Never, the Longhorn controller deletes and recreates the pod, but the same stale metadata still exists, creating an infinite crash loop.
To Reproduce
- Use Longhorn v1.12.x with v2 (SPDK) data engine enabled
- Create ten v2 volumes and attach them, then delete it while the instance-manager pod is being recycled (or force-kill the IM pod while the volume is being deleted)
- The instance-manager pod enters a create/delete loop.
Expected Behavior
The instance-manager pod should start serving gRPC immediately regardless of engine frontend recovery status. Recovery of stale/unreachable targets should fail fast and clean up without affecting pod health.
Support Bundle for Troubleshooting
N/A
Environment
- Longhorn version:
- Impacted volume (PV):
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
- Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
No response
Workaround and Mitigation
No response
Describe the Bug
When an instance-manager pod (v2 data engine) restarts while persisted engine frontend records reference NVMe-TCP targets that are no longer reachable (e.g., deleted volume, migrated engine), the
recoverEngineFrontends()function blocksNewServer()synchronously. This prevents gRPC servers from starting, causing the liveness probe to fail and kubelet to kill the container. Since the pod hasrestartPolicy: Never, the Longhorn controller deletes and recreates the pod, but the same stale metadata still exists, creating an infinite crash loop.To Reproduce
Expected Behavior
The instance-manager pod should start serving gRPC immediately regardless of engine frontend recovery status. Recovery of stale/unreachable targets should fail fast and clean up without affecting pod health.
Support Bundle for Troubleshooting
N/A
Environment
Additional context
No response
Workaround and Mitigation
No response