Replies: 3 comments 1 reply
-
|
If you put all eggs into one basket, and that basket is full, recovering would probably mean that you have to wipe the whole basket. I would recommend to start separating concerns by sizing parts of your disk in appropriate way:
Now And if you run out of space in a user volume, you can address that specifically for a particular workload. |
Beta Was this translation helpful? Give feedback.
-
|
After more playtesting and more data loss, I'm convinced this is a poor user experience. I created a storage class to stick volumes to the node they're on with no replicas. I deployed Elasticsearch and dumped a ton of data into it. One of the nodes was a test server with a 2 TB drive. I provisioned a 1 TB volume on it for Elasticsearch. A 1 TB volume somehow took up 2 TB of space on the node, got disk pressure, and eventually (through my attempts to recover the volume by trying to move it to a node with a 4 TB drive but being unable because DiskPressure evicted the Longhorn pods and changing Longhorn's PriorityClass to 'system-node-critical') I lost all the data again. I think this is probably mostly the fault of Longhorn because it uses up a lot more than you allocate (confirmed with the |
Beta Was this translation helpful? Give feedback.
-
|
I let Talos "do it's thing" when setting up the boxes, so I don't think I have any user volumes. talosctl get uservolumeconfig says After Talos was up, I installed Longhorn using the manifest defaults and told the Elastic Operator to use the 'longhorn' storage class. If I recall, the defaults are to have 3 total copies of the data. I ended up creating a new storage class that just kept one copy on the node that was using it and while they still grow beyond what was allocated to them, it's much less worse. I guess I still need to do some digging about the different volumes I can create and how to say "use every volume available other than the installer USB stick and the system volume" during the install. I sorta expected there to be a 'uservolumeconfig' object I could see/edit/create. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been doing some playtesting with Talos over the last few weeks.
I have 14 machines that are all identical--8 core Atom machines with 2x 256 GB SSDs in them.
I've tried spinning up various workloads and generally been having a fun time.
A few weeks ago I ran into a node complaining about DiskPressure. I rebooted it. No luck. The only pods that would launch were kube-flannel, kube-proxy, and spegel.
I did some digging, and basically came to the conclusion that the cluster should automatically handle things, so there are no options for dealing with the problem.
I ended up draining and resetting the node, then I pre-provisioned it to bring it back online.
Fast forward to today, and I have a pretty involved test Elasticsearch cluster and suddenly a different node is out of space with only the same three pods running.
I had used this snippet (https://gist.github.com/rothgar/3f6a30d76300275d12044ce1d1210283) to get a sorta "root shell" on one of the nodes before to do some troubleshooting related to Longhorn...but I obviously can't launch it on the node due to DiskPressure.
Since the node has nothing running (and Talos isn't a huge operating system), I can't imagine why it would be under DiskPressure when it has a 256 GB disk...and I suspect it's not related to spegel caching because none of the other nodes appear to be low on disk space...
How should I be troubleshooting and resolving this issue. Wiping the node doesn't seem like it's the right way to do it...
Beta Was this translation helpful? Give feedback.
All reactions