Describe the Bug
Running test case Reboot Replica Node While Heavy Writing And Recurring Jobs on v2 data engine,
After node(ip-10-0-1-145) reboot, the replica of RWX volume failed and cause the RWX volume stuck at detaching attacking loop
Screencast.from.2026-05-08.11-09-59.webm
One replica of RWX volume stopped : e2e-test-volume-2-r-978ed44c
6180aea1a697:/src/longhorn-tests# kl get replicas
NAME DATA ENGINE STATE NODE DISK INSTANCEMANAGER IMAGE AGE
e2e-test-volume-0-r-47ff7226 v2 running ip-10-0-1-11 5628843f-bb36-4447-9620-e76e3e096e63 instance-manager-168b469a87e8d67896589454fc707180 longhornio/longhorn-instance-manager:v1.12.x-head 31m
e2e-test-volume-1-r-14cf258d v2 running ip-10-0-1-11 5628843f-bb36-4447-9620-e76e3e096e63 instance-manager-168b469a87e8d67896589454fc707180 longhornio/longhorn-instance-manager:v1.12.x-head 30m
e2e-test-volume-1-r-5d121399 v2 running ip-10-0-1-223 9a464399-fd4e-4fff-b64f-704c820359d1 instance-manager-9d1aa8c527d7436fe411b7e52c6eead4 longhornio/longhorn-instance-manager:v1.12.x-head 30m
e2e-test-volume-1-r-b2b4488e v2 running ip-10-0-1-145 05c35d02-93ac-47ae-8356-c68acd6d7376 instance-manager-9a81748b1ec942bb3c3bc2133b8f3793 longhornio/longhorn-instance-manager:v1.12.x-head 30m
e2e-test-volume-2-r-3dfad716 v2 running ip-10-0-1-223 9a464399-fd4e-4fff-b64f-704c820359d1 instance-manager-9d1aa8c527d7436fe411b7e52c6eead4 longhornio/longhorn-instance-manager:v1.12.x-head 30m
e2e-test-volume-2-r-978ed44c v2 stopped ip-10-0-1-145 05c35d02-93ac-47ae-8356-c68acd6d7376 30m
e2e-test-volume-2-r-a92dc9d0 v2 running ip-10-0-1-11 5628843f-bb36-4447-9620-e76e3e096e63 instance-manager-168b469a87e8d67896589454fc707180 longhornio/longhorn-instance-manager:v1.12.x-head 30m
e2e-test-volume-2-r-978ed44c event
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Start 37m longhorn-replica-controller Starts e2e-test-volume-2-r-978ed44c
Warning FailedStopping 35m longhorn-replica-controller Error stopping e2e-test-volume-2-r-978ed44c: failed to check Instance Manager Instance Service Client connection for instance-manager-9a81748b1ec942bb3c3bc2133b8f3793 IP 10.42.2.19: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.19:8503: i/o timeout"
Normal Stop 35m longhorn-replica-controller Stops e2e-test-volume-2-r-978ed44c
Instance manager log
# kl logs instance-manager-9a81748b1ec942bb3c3bc2133b8f3793 | grep e2e-test-volume-2-r-978ed44c
[longhorn-instance-manager] time="2026-05-08T03:00:47.330895133Z" level=info msg="Detected one possible existing replica block-disk/e2e-test-volume-2-r-978ed44c(737846af-f0dc-46d6-9ff7-74e83f66df34) with disk block-disk(05c35d02-93ac-47ae-8356-c68acd6d7376), spec size 2147483648, actual size 268435456" func="spdk.(*Server).rebuildCachedLvolObjects" file="server.go:416"
https://10.115.5.5/job/private/job/longhorn-e2e-test/303/console
To Reproduce
-t \"Reboot Replica Node While Heavy Writing And Recurring Jobs Exist\" --exclude \"cluster\" --exclude \"storage-network\" --exclude \"large-size\" -v LOOP_COUNT:1 -v RETRY_COUNT:1200 -v DATA_ENGINE:v2
Expected Behavior
test case passed
Support Bundle for Troubleshooting
support_bundle.zip
Environment
Longhorn version: v1.12.0-rc1
Additional context
No response
Workaround and Mitigation
No response
Describe the Bug
Running test case
Reboot Replica Node While Heavy Writing And Recurring Jobson v2 data engine,After node(ip-10-0-1-145) reboot, the replica of RWX volume failed and cause the RWX volume stuck at detaching attacking loop
Screencast.from.2026-05-08.11-09-59.webm
One replica of RWX volume stopped : e2e-test-volume-2-r-978ed44c
e2e-test-volume-2-r-978ed44c event
Instance manager log
https://10.115.5.5/job/private/job/longhorn-e2e-test/303/console
To Reproduce
-t \"Reboot Replica Node While Heavy Writing And Recurring Jobs Exist\" --exclude \"cluster\" --exclude \"storage-network\" --exclude \"large-size\" -v LOOP_COUNT:1 -v RETRY_COUNT:1200 -v DATA_ENGINE:v2Expected Behavior
test case passed
Support Bundle for Troubleshooting
support_bundle.zip
Environment
Additional context
No response
Workaround and Mitigation
No response