Skip to content

[BUG][v1.12.0-rc1] RWX Volume Gets Stuck in Detaching/Attaching Loop After Reboot Replica Node While Heavy Writing And Recurring Jobs on v2 Data Engine #13062

@chriscchien

Description

@chriscchien

Describe the Bug

Running test case Reboot Replica Node While Heavy Writing And Recurring Jobs on v2 data engine,

After node(ip-10-0-1-145) reboot, the replica of RWX volume failed and cause the RWX volume stuck at detaching attacking loop

Screencast.from.2026-05-08.11-09-59.webm

One replica of RWX volume stopped : e2e-test-volume-2-r-978ed44c

6180aea1a697:/src/longhorn-tests# kl get replicas
NAME                           DATA ENGINE   STATE     NODE            DISK                                   INSTANCEMANAGER                                     IMAGE                                               AGE
e2e-test-volume-0-r-47ff7226   v2            running   ip-10-0-1-11    5628843f-bb36-4447-9620-e76e3e096e63   instance-manager-168b469a87e8d67896589454fc707180   longhornio/longhorn-instance-manager:v1.12.x-head   31m
e2e-test-volume-1-r-14cf258d   v2            running   ip-10-0-1-11    5628843f-bb36-4447-9620-e76e3e096e63   instance-manager-168b469a87e8d67896589454fc707180   longhornio/longhorn-instance-manager:v1.12.x-head   30m
e2e-test-volume-1-r-5d121399   v2            running   ip-10-0-1-223   9a464399-fd4e-4fff-b64f-704c820359d1   instance-manager-9d1aa8c527d7436fe411b7e52c6eead4   longhornio/longhorn-instance-manager:v1.12.x-head   30m
e2e-test-volume-1-r-b2b4488e   v2            running   ip-10-0-1-145   05c35d02-93ac-47ae-8356-c68acd6d7376   instance-manager-9a81748b1ec942bb3c3bc2133b8f3793   longhornio/longhorn-instance-manager:v1.12.x-head   30m
e2e-test-volume-2-r-3dfad716   v2            running   ip-10-0-1-223   9a464399-fd4e-4fff-b64f-704c820359d1   instance-manager-9d1aa8c527d7436fe411b7e52c6eead4   longhornio/longhorn-instance-manager:v1.12.x-head   30m
e2e-test-volume-2-r-978ed44c   v2            stopped   ip-10-0-1-145   05c35d02-93ac-47ae-8356-c68acd6d7376                                                                                                           30m
e2e-test-volume-2-r-a92dc9d0   v2            running   ip-10-0-1-11    5628843f-bb36-4447-9620-e76e3e096e63   instance-manager-168b469a87e8d67896589454fc707180   longhornio/longhorn-instance-manager:v1.12.x-head   30m

e2e-test-volume-2-r-978ed44c event

Events:
  Type     Reason          Age   From                         Message
  ----     ------          ----  ----                         -------
  Normal   Start           37m   longhorn-replica-controller  Starts e2e-test-volume-2-r-978ed44c
  Warning  FailedStopping  35m   longhorn-replica-controller  Error stopping e2e-test-volume-2-r-978ed44c: failed to check Instance Manager Instance Service Client connection for instance-manager-9a81748b1ec942bb3c3bc2133b8f3793 IP 10.42.2.19: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.19:8503: i/o timeout"
  Normal   Stop            35m   longhorn-replica-controller  Stops e2e-test-volume-2-r-978ed44c

Instance manager log

# kl logs instance-manager-9a81748b1ec942bb3c3bc2133b8f3793 | grep e2e-test-volume-2-r-978ed44c
[longhorn-instance-manager] time="2026-05-08T03:00:47.330895133Z" level=info msg="Detected one possible existing replica block-disk/e2e-test-volume-2-r-978ed44c(737846af-f0dc-46d6-9ff7-74e83f66df34) with disk block-disk(05c35d02-93ac-47ae-8356-c68acd6d7376), spec size 2147483648, actual size 268435456" func="spdk.(*Server).rebuildCachedLvolObjects" file="server.go:416"

https://10.115.5.5/job/private/job/longhorn-e2e-test/303/console

To Reproduce

-t \"Reboot Replica Node While Heavy Writing And Recurring Jobs Exist\" --exclude \"cluster\" --exclude \"storage-network\" --exclude \"large-size\" -v LOOP_COUNT:1 -v RETRY_COUNT:1200 -v DATA_ENGINE:v2

Expected Behavior

test case passed

Support Bundle for Troubleshooting

support_bundle.zip

Environment

  • Longhorn version: v1.12.0-rc1

Additional context

No response

Workaround and Mitigation

No response

Metadata

Metadata

Labels

area/replicaVolume replica where data is placedarea/v2-data-enginev2 data engine (SPDK)kind/bugkind/regressionRegression which has worked beforepriority/2Nice to implement or fix in this release (managed by PO)reproduce/always100% reproduciblerequire/backportRequire backport. Only used when the specific versions to backport have not been definied.require/qa-review-coverageRequire QA to review coverageseverity/1Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)

Type

No fields configured for Bug.

Projects

Status

Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions