Conversation
|
I was testing various Nebari functionality on this branch with non NFS user home directories. During testing, both JHub Apps and Jupyter Scheduler / Argo Workflows will need work before they can be considered supported with non-shared home directories. The ProblemBoth JHub Apps and Jupyter Scheduler create secondary pods (app server pods, scheduled job pods) that also mount the user's home directory PVC. With RWO storage, a PVC can only be mounted by pods on the same node. Sometimes the secondary pods get scheduled on the same node as the user's primary pod, and everything works fine. But if they land on a different node, the secondary pod remains This behavior is non-deterministic and becomes more likely to fail as the cluster scales (more nodes = lower chance of same-node scheduling). Proposed Solution: PVC CloningThe appropriate solution is to use PVC cloning. This does require that PVC cloning is supported by the CSI driver used by the home dir volumes. It is supported by the default CSI drivers on AWS, GCP, and Azure, and could also be supported by Longhorn or Rook/Ceph on K3S. Under the hood, PVC cloning creates a copy-on-write clone of the volume, so it's fast. Initial testing showed cloning completes in about 30 seconds. Supporting PVC cloning would require some changes to JHub Apps and also Nebari's Jupyter Scheduler integration, but it's fairly straight forward. |
|
Someone brought up the point that cloning the home dir is not super robust b/c the user might have made weird changes to the env after the time that they scheduled the notebook. Also, we may not have all the user's envs available locally anymore. |
Reference Issues or PRs
What does this implement/fix?
Put a
xin the boxes that applyDocumentation
Testing
How to test this PR?
Any other comments?