Skip to content

Implement Longhorn-like behavior for node loss handling #1898

@mc2285

Description

@mc2285

Is your feature request related to a problem? Please describe.
As per this K8S issue there is currently no plans for an official solution to the problem of pods with RWO PVCs attached getting stuck forever if the underlying node is shut down unexpectedly. What is more, this also happens to me sometimes with graceful node shutdown on Mayastor. Having to manually troubleshoot all pods when a node goes down is an unbearable headache.

The lack of a built-in solution to such a seemingly basic problem makes Mayastor appear a not-so-clearly-advantageous choice of a storage solution over ones that provide a batteries-included approach like e.g. Longhorn.

Describe the solution you'd like
A feature like described here in Longhorn docs would be a life-saver for many. Mayastor seems to have some sort of kubectl drain-like functionality implemented in the kubectl-openebs plugin. It would be nice if nodes that are NotReady for a set period of time could be drained (or some subset of drain operation be performed on them) automatically. It would probably suffice to force-terminate pods with VolumeAttachments from the dead node and to delete those attachments subsequently after the timer expires.

K8S should theoretically force-detach volumes on NotReady nodes after 6 minutes, but this operation will only occur if the related pod is already terminated a.k.a never. It is basically a useless feature, in my case at least, and I will try reporting this problem to upstream. However, even if they do happen to agree with my point (they probably won't), it will take years for any alteration to this behavior to reach practical clusters.

It would be nice if node loss and recovery were slightly more automated. I hold in high regard the robust rebuild mechanism that Mayastor implements. It maybe isn't the fastest (e.g. Longhorn V2 now supports delta-snapshot replica rebuild to avoid having to re-transfer the whole volume) but it works all the time. However, what is the point in having robust automatic replica recovery if the underlying workload will be stuck awaiting manual intervention anyways?

As for data-loss and breaking-change concerns, I would expect this behavior to be optional and disabled by default. I fully understand the upstream rationale that this behavior might be undesired for some workloads, however it is also not only desired but I would say crucial for workloads like mine where I have a non-clustered application that I want to automatically failover to another node if I loose one.

Describe alternatives you've considered
I considered using third-party controllers to address this problem, like e.g. Descheduler but they do not seem to aim to address this issue and e.g. kube-fencing is not actively maintained. I have been unable to find a third-party controller that addresses this issue universally across all CSI providers. Judging by the fact that Longhorn decided to implement this functionality themselves in their CSI plugin, it seems to be the only reasonable approach as of now

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions