Recovering from DiskPressure #11763

darkpixel · 2025-09-04T19:22:39Z

darkpixel
Sep 4, 2025

I've been doing some playtesting with Talos over the last few weeks.
I have 14 machines that are all identical--8 core Atom machines with 2x 256 GB SSDs in them.
I've tried spinning up various workloads and generally been having a fun time.

A few weeks ago I ran into a node complaining about DiskPressure. I rebooted it. No luck. The only pods that would launch were kube-flannel, kube-proxy, and spegel.

I did some digging, and basically came to the conclusion that the cluster should automatically handle things, so there are no options for dealing with the problem.

I ended up draining and resetting the node, then I pre-provisioned it to bring it back online.

Fast forward to today, and I have a pretty involved test Elasticsearch cluster and suddenly a different node is out of space with only the same three pods running.

I had used this snippet (https://gist.github.com/rothgar/3f6a30d76300275d12044ce1d1210283) to get a sorta "root shell" on one of the nodes before to do some troubleshooting related to Longhorn...but I obviously can't launch it on the node due to DiskPressure.

Since the node has nothing running (and Talos isn't a huge operating system), I can't imagine why it would be under DiskPressure when it has a 256 GB disk...and I suspect it's not related to spegel caching because none of the other nodes appear to be low on disk space...

How should I be troubleshooting and resolving this issue. Wiping the node doesn't seem like it's the right way to do it...

smira · 2025-09-05T14:30:30Z

smira
Sep 5, 2025
Maintainer

If you put all eggs into one basket, and that basket is full, recovering would probably mean that you have to wipe the whole basket.

I would recommend to start separating concerns by sizing parts of your disk in appropriate way:

Size EPHEMERAL to fit all your running container images, and things like emptyDir, container scratch space. kubelet will monitor the size, and eventually under disk pressure will start GC, or the machine will be unhealthy.
For all workloads that have persistence on disk via hostPath volumes, or something close to that, allocate explicit user volumes.

Now EPHEMERAL should be fine, you should not be able to run out of space.

And if you run out of space in a user volume, you can address that specifically for a particular workload.

0 replies

darkpixel · 2025-09-15T17:56:40Z

darkpixel
Sep 15, 2025
Author

After more playtesting and more data loss, I'm convinced this is a poor user experience.

I created a storage class to stick volumes to the node they're on with no replicas.

I deployed Elasticsearch and dumped a ton of data into it.

One of the nodes was a test server with a 2 TB drive. I provisioned a 1 TB volume on it for Elasticsearch.
I let Elasticsearch do it's thing and filled the cluster completely up.

A 1 TB volume somehow took up 2 TB of space on the node, got disk pressure, and eventually (through my attempts to recover the volume by trying to move it to a node with a 4 TB drive but being unable because DiskPressure evicted the Longhorn pods and changing Longhorn's PriorityClass to 'system-node-critical') I lost all the data again.

I think this is probably mostly the fault of Longhorn because it uses up a lot more than you allocate (confirmed with the du command), but to some degree I thi8nk Talos needs to make it easier to deal with this situation should it occur. Heck, even just an option to forcibly remove stuff that isn't necessary (maybe logfiles, caches, etc...) and underlying volumes would be helpful.

1 reply

rothgar Sep 15, 2025
Maintainer

Thanks for doing testing and reporting back. Did you create user volumes in Talos (separate partitions) or were you creating Kubernetes volumes?

If you create the volume at the filesystem/talos layer it shouldn't be possible for the Kubernetes volume to grow beyond the partition size. Keeping that separate from the EPHEMERAL volume should still allow you to run workloads even if one of the workload volumes is full.

darkpixel · 2025-09-22T16:32:11Z

darkpixel
Sep 22, 2025
Author

I let Talos "do it's thing" when setting up the boxes, so I don't think I have any user volumes. talosctl get uservolumeconfig says rpc error: code = NotFound desc = resource "uservolumeconfig" is not registered. (kubectl says the resource doesn't exist. I'm still trying to wrap my head around creating them so I can take advantage of the "extra" disks on each node).

After Talos was up, I installed Longhorn using the manifest defaults and told the Elastic Operator to use the 'longhorn' storage class. If I recall, the defaults are to have 3 total copies of the data.

I ended up creating a new storage class that just kept one copy on the node that was using it and while they still grow beyond what was allocated to them, it's much less worse.

I guess I still need to do some digging about the different volumes I can create and how to say "use every volume available other than the installer USB stick and the system volume" during the install.

I sorta expected there to be a 'uservolumeconfig' object I could see/edit/create.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recovering from DiskPressure #11763

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Recovering from DiskPressure #11763

Uh oh!

darkpixel Sep 4, 2025

Replies: 3 comments · 1 reply

Uh oh!

smira Sep 5, 2025 Maintainer

Uh oh!

darkpixel Sep 15, 2025 Author

Uh oh!

rothgar Sep 15, 2025 Maintainer

Uh oh!

darkpixel Sep 22, 2025 Author

darkpixel
Sep 4, 2025

Replies: 3 comments 1 reply

smira
Sep 5, 2025
Maintainer

darkpixel
Sep 15, 2025
Author

rothgar Sep 15, 2025
Maintainer

darkpixel
Sep 22, 2025
Author