Implement `Longhorn`-like behavior for node loss handling

**Is your feature request related to a problem? Please describe.**
As per [this](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/issues/181) `K8S` issue there is currently no plans for an official solution to the problem of pods with `RWO` `PVCs` attached getting stuck forever if the underlying node is shut down unexpectedly. What is more, this also happens to me sometimes with graceful node shutdown on `Mayastor`. Having to manually troubleshoot all pods when a node goes down is an unbearable headache.

The lack of a built-in solution to such a seemingly basic problem makes `Mayastor` appear a not-so-clearly-advantageous choice of a storage solution over ones that provide a batteries-included approach like e.g. `Longhorn`.

**Describe the solution you'd like**
A feature like described [here](https://longhorn.io/docs/1.9.1/high-availability/node-failure/) in `Longhorn` docs would be a life-saver for many. Mayastor seems to have some sort of `kubectl drain`-like functionality implemented in the `kubectl-openebs` plugin. It would be nice if nodes that are `NotReady` for a set period of time could be drained (or some subset of `drain` operation be performed on them) automatically. It would probably suffice to force-terminate pods with `VolumeAttachments` from the dead node and to delete those attachments subsequently after the timer expires.

`K8S` should theoretically force-detach volumes on `NotReady` nodes after 6 minutes, but this operation will only occur if the related pod is already terminated a.k.a never. It is basically a useless feature, in my case at least, and I will try reporting this problem to upstream. However, even if they do happen to agree with my point (they probably won't), it will take years for any alteration to this behavior to reach practical clusters.

It would be nice if node loss and recovery were slightly more automated. I hold in high regard the robust rebuild mechanism that `Mayastor` implements. It maybe isn't the fastest (e.g. `Longhorn V2` now supports delta-snapshot replica rebuild to avoid having to re-transfer the whole volume) but it works all the time. However, what is the point in having robust automatic replica recovery if the underlying workload will be stuck awaiting manual intervention anyways?

As for data-loss and breaking-change concerns, I would expect this behavior to be optional and disabled by default. I fully understand the upstream rationale that this behavior might be undesired for some workloads, however it is also not only desired but I would say crucial for workloads like mine where I have a non-clustered application that I want to automatically failover to another node if I loose one.

**Describe alternatives you've considered**
I considered using third-party controllers to address this problem, like e.g. `Descheduler` but they do not seem to aim to address this issue and e.g. `kube-fencing` is not actively maintained. I have been unable to find a third-party controller that addresses this issue universally across all `CSI` providers. Judging by the fact that `Longhorn` decided to implement this functionality themselves in their `CSI` plugin, it seems to be the only reasonable approach as of now


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement `Longhorn`-like behavior for node loss handling #1898

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Longhorn-like behavior for node loss handling #1898

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Implement `Longhorn`-like behavior for node loss handling #1898